Example of using CuPy and Ray on the GPU

wigging · February 1, 2022, 4:44pm

I’m trying to understand how to use Ray with a GPU. For a basic GPU test (see Example 1), I use CuPy to perform array operations on the GPU. The operations are done several times using a for-loop. I compared the CuPy example to a similar example that uses Ray with the GPU (see Example 2). As you can see by the elapsed times, the CuPy + Ray example performed much slower than the CuPy only example. So why is the CuPy + Ray example so slow?

Example 1 (CuPy only)

import cupy as cp
import time

def multiply(x):
    x_cupy = cp.ones((1000, 1000, 200))
    y = x_cupy * x
    z = y / 2 + 3.5
    return z

tic = time.perf_counter()

for x in range(1000):
    z = multiply(x)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])

Output from Example 1 is:

Elapsed time 10.93 s
[503. 503. 503. 503. 503.]

Example 2 (CuPy and Ray)

import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.1)
def multiply(x):
    x_cupy = cp.ones((1000, 1000, 200))
    y = x_cupy * x
    z = y / 2 + 3.5
    return z

tic = time.perf_counter()

for x in range(1000):
    lazy_z = multiply.remote(x)
    z = ray.get(lazy_z)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])

ray.shutdown()

Output from Example 2 is:

Canceled the run because it was taking too long. See modified version below.

Here’s a modified version using range(100):

import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.1)
def multiply(x):
    x_cupy = cp.ones((1000, 1000, 200))
    y = x_cupy * x
    z = y / 2 + 3.5
    return z

tic = time.perf_counter()

for x in range(100):
    lazy_z = multiply.remote(x)
    z = ray.get(lazy_z)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(z[0][0][0:5])

ray.shutdown()

Output from the modified version:

Elapsed time 154.28 s
[53. 53. 53. 53. 53.]

matthewdeng · February 1, 2022, 5:13pm

Unrelated to CuPy, to utilize Ray’s parallelism you’ll want to run all your remote tasks before calling ray.get which is a blocking call.

See Antipattern: Calling ray.get in a loop — Ray v1.9.2 for a code example!

wigging · February 1, 2022, 7:36pm

Based on the reply from @matthewdeng, I updated my example to work better with Ray. See the updated examples below. The CuPy + Ray example is still slower. Any suggestions on how to improve the CuPy + Ray example?

Example 3 (CuPy only)

import cupy as cp
import time

def summation(x):
    x_cupy = cp.ones((1000, 1000, 20))
    y = x_cupy * x
    z = y / 2 + 3.5
    s = cp.sum(z)
    return s

tic = time.perf_counter()

sums = []

for x in range(1000):
    s = summation(x)
    sums.append(s)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(sums[0])
print(sums[-1])

Elapsed time is 12.38 seconds.

Example 4 (CuPy and Ray)

import cupy as cp
import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.2)
def summation(x):
    x_cupy = cp.ones((1000, 1000, 20))
    y = x_cupy * x
    z = y / 2 + 3.5
    s = cp.sum(z)
    return s

tic = time.perf_counter()

lazy_sums = []

for x in range(1000):
    s = summation.remote(x)
    lazy_sums.append(s)

sums = ray.get(lazy_sums)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(sums[0])
print(sums[-1])

ray.shutdown()

Elapsed time is 17.2 seconds.

matthewdeng · February 1, 2022, 11:24pm

Is there any difference if you increase the parallelization?

- @ray.remote(num_gpus=0.2)
+ @ray.remote(num_gpus=0.1)

It also could be that the function itself is really fast, so the overhead from scheduling the task is more noticeable: Antipattern: Too fine-grained tasks — Ray v1.9.2

wigging · February 2, 2022, 12:43am

No, I do not see much of a difference going from num_gpus=0.2 to num_gpus=0.1. If you have a better example then please share. I’m just trying to develop an example that demonstrates using Ray with CuPy on a GPU.

Topic		Replies	Views
Run Python function in parallel on GPU Ray Core	10	4710	January 28, 2022
[Core] I found Ray ran numpy.dot slower than numpy.dot not run in ray Ray Core	6	472	May 1, 2021
I need help! It took so long to execute a remote task in the Ray 1.13 when CUDA is involved Ray Core	12	326	August 12, 2022
How to run more than 2 parallel process Ray Core	3	331	November 23, 2021
How to effectively use parallelization with Ray in Python? Ray Core	5	4343	December 2, 2022

Example of using CuPy and Ray on the GPU

Example 1 (CuPy only)

Example 2 (CuPy and Ray)

Example 3 (CuPy only)

Example 4 (CuPy and Ray)

Related topics