Example of using CuPy and Ray on the GPU

I’m trying to understand how to use Ray with a GPU. For a basic GPU test (see Example 1), I use CuPy to perform array operations on the GPU. The operations are done several times using a for-loop. I compared the CuPy example to a similar example that uses Ray with the GPU (see Example 2). As you can see by the elapsed times, the CuPy + Ray example performed much slower than the CuPy only example. So why is the CuPy + Ray example so slow?

Example 1 (CuPy only)

import cupy as cp
import time

def multiply(x):
    x_cupy = cp.ones((1000, 1000, 200))
    y = x_cupy * x
    z = y / 2 + 3.5
    return z

tic = time.perf_counter()

for x in range(1000):
    z = multiply(x)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])

Output from Example 1 is:

Elapsed time 10.93 s
[503. 503. 503. 503. 503.]

Example 2 (CuPy and Ray)

import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.1)
def multiply(x):
    x_cupy = cp.ones((1000, 1000, 200))
    y = x_cupy * x
    z = y / 2 + 3.5
    return z

tic = time.perf_counter()

for x in range(1000):
    lazy_z = multiply.remote(x)
    z = ray.get(lazy_z)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])

ray.shutdown()

Output from Example 2 is:

Canceled the run because it was taking too long. See modified version below.

Here’s a modified version using range(100):

import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.1)
def multiply(x):
    x_cupy = cp.ones((1000, 1000, 200))
    y = x_cupy * x
    z = y / 2 + 3.5
    return z

tic = time.perf_counter()

for x in range(100):
    lazy_z = multiply.remote(x)
    z = ray.get(lazy_z)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])

ray.shutdown()

Output from the modified version:

Elapsed time 154.28 s
[53. 53. 53. 53. 53.]

Unrelated to CuPy, to utilize Ray’s parallelism you’ll want to run all your remote tasks before calling ray.get which is a blocking call.

See Antipattern: Calling ray.get in a loop — Ray v1.9.2 for a code example!

Based on the reply from @matthewdeng, I updated my example to work better with Ray. See the updated examples below. The CuPy + Ray example is still slower. Any suggestions on how to improve the CuPy + Ray example?

Example 3 (CuPy only)

import cupy as cp
import time

def summation(x):
    x_cupy = cp.ones((1000, 1000, 20))
    y = x_cupy * x
    z = y / 2 + 3.5
    s = cp.sum(z)
    return s

tic = time.perf_counter()

sums = []

for x in range(1000):
    s = summation(x)
    sums.append(s)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(sums[0])
print(sums[-1])

Elapsed time is 12.38 seconds.

Example 4 (CuPy and Ray)

import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.2)
def summation(x):
    x_cupy = cp.ones((1000, 1000, 20))
    y = x_cupy * x
    z = y / 2 + 3.5
    s = cp.sum(z)
    return s

tic = time.perf_counter()

lazy_sums = []

for x in range(1000):
    s = summation.remote(x)
    lazy_sums.append(s)

sums = ray.get(lazy_sums)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(sums[0])
print(sums[-1])

ray.shutdown()

Elapsed time is 17.2 seconds.

Is there any difference if you increase the parallelization?

- @ray.remote(num_gpus=0.2)
+ @ray.remote(num_gpus=0.1)

It also could be that the function itself is really fast, so the overhead from scheduling the task is more noticeable: Antipattern: Too fine-grained tasks — Ray v1.9.2

No, I do not see much of a difference going from num_gpus=0.2 to num_gpus=0.1. If you have a better example then please share. I’m just trying to develop an example that demonstrates using Ray with CuPy on a GPU.