I need help! It took so long to execute a remote task in the Ray 1.13 when CUDA is involved

How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.

Hi there, I found it took so long to execute a remote task in Ray 1.13 when CUDA is involved.
My problem I met is that I tried to use the Ray client to submit a cuda task, and then found the execution took 2s which is so long , while if I removed the @remote decoration , the execution is way too less than 2s. I am using the local area network so the network speed is very fast. It looks the remote task is much slower than the local task, so I need a help to solve this problem. Thank you for your kindness response.

Python: 3.10.5
Ray: 1.13 release
Cuda:11.4
Torch :1.12
network: LAN

Here is my code:
import time
import ray
import torch

@ray.remote( num_gpus = 1)
def test_cuda():

cuda = torch.device('cuda:0')     # Default CUDA device
x = torch.tensor([1., 2.], device=cuda)
y = torch.tensor([1., 2.], device=cuda)
z = x + y

ray.init(address=“xx.xx.xx.xx:xxxx”)

for i in range(10):
start_detect = time.time()

ref = test_cuda.remote()
ray.get(ref)

end_detect = time.time()
print ("Time used to detect - ", end_detect - start_detect )
1 Like

Hello. So does anyone know why the remote function test_cuda() runs much slower than the function test_cuda() without remote decoration. What should I do to boost the remote function test_cuda() to make it run fast or at least not slower than the function without remote decoration? Thank you for your help!

Hi @JUAN_CHEN,

I’ll try to reproduce it and get back to you.

Hi @JUAN_CHEN,

Instead of using Ray client, could you try to run the program directly on the head node and see how fast it is? You just need to change ray.init(address="xxx.xx") to ray.init(address="auto"). Could you also record start and end time inside test_cuda to see the actual runtime of the function? When running Ray remote function, it does come with overhead (e.g. shipping the function definition to the remote node, start a remote worker process, etc), we need to confirm that it’s the overhead that makes it slow.

1 Like

Thank you @jjyao for your response. I adopted your suggestion to change my code like this:
import time
import ray
import torch

@ray.remote( num_gpus = 1)
def test_cuda():
start_cuda=time.time()
cuda = torch.device(‘cuda:0’) # Default CUDA device
x = torch.tensor([1., 2.], device=cuda)
y = torch.tensor([1., 2.], device=cuda)
z = x + y
end_cuda=time.time()
print("time inside cuda is ",end_cuda-start_cuda)

ray.init(address=“auto”)

for i in range(10):
start_detect = time.time()

ref = test_cuda.remote()
ray.get(ref)
#test_cuda()

end_detect = time.time()
print ("Time used to detect - ", end_detect - start_detect )

print(“end”)

The result is :
(test_cuda pid=4004959) time inside cuda is 2.4671072959899902
Time used to detect - 3.4718945026397705
Time used to detect - 2.274562120437622
(test_cuda pid=4005014) time inside cuda is 1.4709711074829102
Time used to detect - 2.2733371257781982
(test_cuda pid=4005044) time inside cuda is 1.4566268920898438
Time used to detect - 2.3180081844329834
(test_cuda pid=4005074) time inside cuda is 1.5230095386505127
Time used to detect - 2.3149635791778564
(test_cuda pid=4005105) time inside cuda is 1.4816842079162598
Time used to detect - 2.2338507175445557
(test_cuda pid=4005135) time inside cuda is 1.439739465713501
Time used to detect - 2.2487499713897705
(test_cuda pid=4005167) time inside cuda is 1.4355778694152832
Time used to detect - 2.232961893081665
(test_cuda pid=4005198) time inside cuda is 1.4322755336761475
Time used to detect - 2.205812931060791
(test_cuda pid=4005229) time inside cuda is 1.4372191429138184

so you will find the test_cuda() function runs around 1.4s which also takes too long.
if I removed the @remote decoration , test_cuda() functions only runs around 0.00Xs.

Since we need to share cuda device among the multiple processes, I switched my code to the actor solution, the code is like this:
from glob import glob
import time
import ray
import torch

@ray.remote( num_gpus = 1)
class Test_Cuda:
def init(self):
self.cuda_1 = None
def Do(self):
#if self.cuda_1 == None:
self.cuda_1 = torch.device(‘cuda:0’) # Default CUDA device
start_cuda=time.time()
x = torch.tensor([1., 2.], device=self.cuda_1)
y = torch.tensor([1., 2.], device=self.cuda_1)
z = x + y
end_cuda=time.time()
print("time inside cuda is ",end_cuda-start_cuda)

ray.init(address=“auto”)

test_eng = Test_Cuda.remote()
for i in range(100):
start_detect = time.time()

ref = test_eng.Do.remote()
ray.get(ref)

end_detect = time.time()
print ("Time used to detect - ", end_detect - start_detect )

print(“end”)

I found the Do() function is in the same process which make the Do() function run very faster,

(Test_Cuda pid=4010482) time inside cuda is 7.389885425567627
Time used to detect - 8.325343370437622
Time used to detect - 0.002629995346069336
Time used to detect - 0.0019180774688720703
Time used to detect - 0.0018551349639892578
Time used to detect - 0.0019295215606689453
Time used to detect - 0.001873016357421875
Time used to detect - 0.0016896724700927734
Time used to detect - 0.0018720626831054688
Time used to detect - 0.0018792152404785156
Time used to detect - 0.0017347335815429688
end
(Test_Cuda pid=4010482) time inside cuda is 0.0002753734588623047
(Test_Cuda pid=4010482) time inside cuda is 0.00023245811462402344
(Test_Cuda pid=4010482) time inside cuda is 0.0002257823944091797
(Test_Cuda pid=4010482) time inside cuda is 0.0002238750457763672
(Test_Cuda pid=4010482) time inside cuda is 0.0002219676971435547
(Test_Cuda pid=4010482) time inside cuda is 0.0002124309539794922
(Test_Cuda pid=4010482) time inside cuda is 0.00024700164794921875
(Test_Cuda pid=4010482) time inside cuda is 0.00021386146545410156
(Test_Cuda pid=4010482) time inside cuda is 0.0002117156982421875

However the actor solution is not the ideal solution, and we have to do the scheduling by ourselves instead of Ray in our driver code because we have 2 input cameras and 5 gpu nodes.

Could you elaborate why actor is not desirable? What do you mean by doing scheduling your self? Seems running your code on the same process is fast (probably due to some cached setup), if that’s the case, you can create a pool of actors and dispatch tasks to them?

@jjyao, Thanks to help use on this issue, ActorPool is a solution for us, We can try that. and can you help use identify why using task not actor is very slow? because in our real project, we use third-party libs to do object detection which is using pytorch, we tried task and actor( keep in same process), both will take seconds to run each task, but if we remove out ray, same code will run task in milliseconds

can you just verify code below which take seconds to run each task ? this is the original question we post

import time
import ray
import torch

@ray.remote( num_gpus = 1)
def test_cuda():
cuda = torch.device(‘cuda:0’) # Default CUDA device
x = torch.tensor([1., 2.], device=cuda)
y = torch.tensor([1., 2.], device=cuda)
z = x + y

ray.init(address=“xx.xx.xx.xx:xxxx”)

for i in range(10):
start_detect = time.time()
ref = test_cuda.remote()
ray.get(ref)
end_detect = time.time()
print ("Time used to detect - ", end_detect - start_detect )

I was able to reproduce the same result.

The first run of test_cuda() is around 1.5 seconds regardless if it’s using ray.remote or not. However the remaining runs on the same process are very fast (i.e. milliseconds). If you use ray task then effective each run is on a fresh process so every time it’s 1.5 seconds. If you don’t use ray.remote or use ray actor, then all runs are on the same process, so runs except the first one are fast.

I’m not sure what pytorch does to make subsequent runs on the same process faster (my guess is that there is some initialization overhead that only happens once @cade do you know?).

1 Like

+1 – Torch will initialize a CUDA context on first use of the GPU. CUDA context creation is expensive, as torch will load a bunch of kernels onto the device and also initialize the caching allocator. However, once this CUDA context has been created, the process can reuse it for the lifetime of the Torch program.

2 Likes

@jjyao thanks for the reproduce