I’m trying to understand how to use Ray with a GPU. For a basic GPU test (see Example 1), I use CuPy to perform array operations on the GPU. The operations are done several times using a for-loop. I compared the CuPy example to a similar example that uses Ray with the GPU (see Example 2). As you can see by the elapsed times, the CuPy + Ray example performed much slower than the CuPy only example. So why is the CuPy + Ray example so slow?

## Example 1 (CuPy only)

```
import cupy as cp
import time
def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z
tic = time.perf_counter()
for x in range(1000):
z = multiply(x)
toc = time.perf_counter()
print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])
```

Output from Example 1 is:

```
Elapsed time 10.93 s
[503. 503. 503. 503. 503.]
```

## Example 2 (CuPy and Ray)

```
import cupy as cp
import ray
import time
ray.init(num_gpus=1)
@ray.remote(num_gpus=0.1)
def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z
tic = time.perf_counter()
for x in range(1000):
lazy_z = multiply.remote(x)
z = ray.get(lazy_z)
toc = time.perf_counter()
print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])
ray.shutdown()
```

Output from Example 2 is:

```
Canceled the run because it was taking too long. See modified version below.
```

Here’s a modified version using `range(100)`

:

```
import cupy as cp
import ray
import time
ray.init(num_gpus=1)
@ray.remote(num_gpus=0.1)
def multiply(x):
x_cupy = cp.ones((1000, 1000, 200))
y = x_cupy * x
z = y / 2 + 3.5
return z
tic = time.perf_counter()
for x in range(100):
lazy_z = multiply.remote(x)
z = ray.get(lazy_z)
toc = time.perf_counter()
print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])
ray.shutdown()
```

Output from the modified version:

```
Elapsed time 154.28 s
[53. 53. 53. 53. 53.]
```