# Example of using CuPy and Ray on the GPU

I’m trying to understand how to use Ray with a GPU. For a basic GPU test (see Example 1), I use CuPy to perform array operations on the GPU. The operations are done several times using a for-loop. I compared the CuPy example to a similar example that uses Ray with the GPU (see Example 2). As you can see by the elapsed times, the CuPy + Ray example performed much slower than the CuPy only example. So why is the CuPy + Ray example so slow?

## Example 1 (CuPy only)

``````import cupy as cp
import time

def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z

tic = time.perf_counter()

for x in range(1000):
z = multiply(x)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])
``````

Output from Example 1 is:

``````Elapsed time 10.93 s
[503. 503. 503. 503. 503.]
``````

## Example 2 (CuPy and Ray)

``````import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.1)
def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z

tic = time.perf_counter()

for x in range(1000):
lazy_z = multiply.remote(x)
z = ray.get(lazy_z)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])

ray.shutdown()
``````

Output from Example 2 is:

``````Canceled the run because it was taking too long. See modified version below.
``````

Here’s a modified version using `range(100)`:

``````import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.1)
def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z

tic = time.perf_counter()

for x in range(100):
lazy_z = multiply.remote(x)
z = ray.get(lazy_z)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])

ray.shutdown()
``````

Output from the modified version:

``````Elapsed time 154.28 s
[53. 53. 53. 53. 53.]
``````

Unrelated to CuPy, to utilize Ray’s parallelism you’ll want to run all your remote tasks before calling `ray.get` which is a blocking call.

See Antipattern: Calling ray.get in a loop — Ray v1.9.2 for a code example!

Based on the reply from @matthewdeng, I updated my example to work better with Ray. See the updated examples below. The CuPy + Ray example is still slower. Any suggestions on how to improve the CuPy + Ray example?

## Example 3 (CuPy only)

``````import cupy as cp
import time

def summation(x):
x_cupy = cp.ones((1000, 1000, 20))
y = x_cupy * x
z = y / 2 + 3.5
s = cp.sum(z)
return s

tic = time.perf_counter()

sums = []

for x in range(1000):
s = summation(x)
sums.append(s)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(sums[0])
print(sums[-1])
``````

Elapsed time is 12.38 seconds.

## Example 4 (CuPy and Ray)

``````import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.2)
def summation(x):
x_cupy = cp.ones((1000, 1000, 20))
y = x_cupy * x
z = y / 2 + 3.5
s = cp.sum(z)
return s

tic = time.perf_counter()

lazy_sums = []

for x in range(1000):
s = summation.remote(x)
lazy_sums.append(s)

sums = ray.get(lazy_sums)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(sums[0])
print(sums[-1])

ray.shutdown()
``````

Elapsed time is 17.2 seconds.

Is there any difference if you increase the parallelization?

``````- @ray.remote(num_gpus=0.2)
+ @ray.remote(num_gpus=0.1)
``````

It also could be that the function itself is really fast, so the overhead from scheduling the task is more noticeable: Antipattern: Too fine-grained tasks — Ray v1.9.2

No, I do not see much of a difference going from `num_gpus=0.2` to `num_gpus=0.1`. If you have a better example then please share. I’m just trying to develop an example that demonstrates using Ray with CuPy on a GPU.