Run Python function in parallel on GPU

I’m running the example shown below on a machine with 80 CPU cores and 4 GPUs. Each GPU is an Nvidia Tesla V100. The elapsed time for the example is 13 seconds.

import ray
import time

ray.init()

@ray.remote
def squared(x):
    time.sleep(1)
    y = x**2
    return y

tic = time.perf_counter()

lazy_values = [squared.remote(x) for x in range(1000)]
values = ray.get(lazy_values)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(f'{values[:5]} ... {values[-5:]}')

ray.shutdown()

I modified the example to run on the GPU as shown in the code below. The elapsed time for the GPU version is also 13 seconds.

import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.01)
def squared(x):
    time.sleep(1)
    y = x**2
    return y

tic = time.perf_counter()

lazy_values = [squared.remote(x) for x in range(1000)]
values = ray.get(lazy_values)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(f'{values[:5]} ... {values[-5:]}')

ray.shutdown()

I’m not seeing any speedup using the GPU compared to the CPU. How do I define the num_gpus to fully take advantage of the parallel features of the GPU? Is there something else I need to setup for Ray to properly use the GPU? Or is this a bad example for running on a GPU?

1 Like

Hey @wigging,

Ray itself will not utilize the GPUs, and it is left up to the application code to implement such optimizations.

If you haven’t seen it before, GPU Support — Ray v1.9.2 describes the behavior in more detail. Some of the "note"s are particular useful for common misconceptions!

Hi, Matthew.

Will CUDA threads also be restricted (divided) when specifying gpu_num=0.x for a worker (I understand it divides the GPU memory )?

My use case is that I got a simulator that is written in c++, and it will try to invoke all CUDA threads for calculation. In this case, I can not get parallelism with one GPU, as parallel processes will sequentially quest all the CUDA threads and then return to the next process. I wonder if replacing the process with ray worker would help increase parallelism in this case.

I think I understand what you are saying. In my example, the GPU is not actually used because the code in the body of the function is not written to run on the GPU. I would need to use some code written with CuPy in the function. Is that correct?

Also, if I change num_gpus=0.01 to num_gpus=0.1 in my example then I get a slower elapsed time of 100 seconds. So why does this affect the execution time if the GPU isn’t actually used?

Hey @LoveRay, AFAIK it won’t actually divide the memory or threads, similar to how requesting CPUs won’t actually restrict any task/actor to that number of CPUs.

I would need to use some code written with CuPy in the function. Is that correct?

Yep that sounds correct to me!

Also, if I change num_gpus=0.01 to num_gpus=0.1 in my example then I get a slower elapsed time of 100 seconds. So why does this affect the execution time if the GPU isn’t actually used?

Based on your setup from the original post, if you have 4 GPUs available then you can have 400 tasks scheduled at the same time when num_gpus=0.01 and only 40 tasks when num_gpus=0.1, so in the latter case tasks will spend more time waiting for resources to free up before being scheduled.

How did you determine that there can be a total of 400 tasks scheduled for 4 GPUs? Is each GPU limited to 100 tasks?

Ah, so you can think of a GPU as an arbitrary resource. Each task will reserve num_gpus=0.01 GPU resources, so 1 GPU resource will allow 100 tasks.

When num_gpus=0.1, each GPU resource would allow 10 tasks. Similarly, if you tried setting num_gpus=0.25, each GPU resource would only allow 4 tasks.

Thanks for the reply, sir. So I wonder if in my case, specifying less than one GPU for each worker will work.

My use case is that I got a simulator that is written in c++, and it will try to invoke all CUDA threads for calculation. When using normal multiprocessing pkg, I can not get parallelism with one GPU, as parallel processes will sequentially quest all the CUDA threads and then return to the next process. I wonder if replacing the process with ray worker(gpu_nums=0.x) would help increase parallelism in this case.

To my knowledge it wouldn’t, as Ray doesn’t interact with the GPUs directly and mostly just uses them for bookkeeping. Ray does set the CUDA_VISIBLE_DEVICES environment variable, though in this case if you’re sharing a single GPU across multiple tasks it wouldn’t provide any isolation.

I’ve never tried the case you mentioned myself - if it’s not too complicated you could perhaps test it out?

Hi, Matthew. Thanks for the reply. I understand your concern. I tried to run it myself. But using the simulator running on the ray worker requires a CUDA context. In normal multiprocessing, I can just use:

context=mp.get_context('spawn')#NOTE: without spawn context will fail as it uses CUDA!
self.ps = context.Process(target=worker, args=(...))

And if not specifying context like above, I would receive errors like follows:

#can be solved with 'spawn' context
 CUDA error at /root/liuwj/Skeletal-Fluid/src/MeshLoad/MeshDatatGPUKernel.cu:37 code=101(cudaErrorInvalidDevice) "cudaGetLastError()"

When I replace multiprocessing with ray, I found the exact same error message is produced. Thus I was not able to test that with ray workers. I guess it might be the problem of the context, but I was not able to finish my test. Thus I raised a new thread here.