I’m trying to understand how to use Ray with a GPU. For a basic GPU test (see Example 1), I use CuPy to perform array operations on the GPU. The operations are done several times using a for-loop. I compared the CuPy example to a similar example that uses Ray with the GPU (see Example 2). As you can see by the elapsed times, the CuPy + Ray example performed much slower than the CuPy only example. So why is the CuPy + Ray example so slow?
Example 1 (CuPy only)
import cupy as cp
import time
def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z
tic = time.perf_counter()
for x in range(1000):
z = multiply(x)
toc = time.perf_counter()
print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])
Output from Example 1 is:
Elapsed time 10.93 s
[503. 503. 503. 503. 503.]
Example 2 (CuPy and Ray)
import cupy as cp
import ray
import time
ray.init(num_gpus=1)
@ray.remote(num_gpus=0.1)
def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z
tic = time.perf_counter()
for x in range(1000):
lazy_z = multiply.remote(x)
z = ray.get(lazy_z)
toc = time.perf_counter()
print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])
ray.shutdown()
Output from Example 2 is:
Canceled the run because it was taking too long. See modified version below.
Here’s a modified version using range(100)
:
import cupy as cp
import ray
import time
ray.init(num_gpus=1)
@ray.remote(num_gpus=0.1)
def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z
tic = time.perf_counter()
for x in range(100):
lazy_z = multiply.remote(x)
z = ray.get(lazy_z)
toc = time.perf_counter()
print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])
ray.shutdown()
Output from the modified version:
Elapsed time 154.28 s
[53. 53. 53. 53. 53.]