How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
wait()has an option
fetch_local. When it sets to false, I think Ray just checks the object references are ready in the Ray cluster and Ray doesn’t transmit the data.
I try to follow the C++ code but I find it still
Get()the objects. Here is something I find.
I find this because I try Ray in a cluster having IB and ethernet. I run the same script but the time about
ray.wait() shows different. In my opinion, IB has better bandwidth so the time of
ray.wait() will be short. But the results show different. The ethernet is faster. And I also timing the
ray.get(). The time with IB is shorter which seems reasonable.
I try another script and get some other wired results. I run the script with two nodes which have 32 cores. So I can make sure both nodes running
import ray import sys import numpy as np import time @ray.remote(num_cpus=32) def set(data): h=data.shape x=ray.put(np.ones(1024*1024*1024)) time.sleep(15) return x if __name__ == '__main__': # Start Ray. ray.init(address='auto') print(ray.cluster_resources()) num_node =int(sys.argv) print(f'get time,wait time,remote time') for _ in range(10): remote_time=time.time() lix=[set.remote(np.ones(1024*1024*1024)) for _ in range(num_node)] wait_time=time.time() li=lix while len(li) !=0: done,li=ray.wait(li,fetch_local=False) start_get=time.time() l=ray.get(lix) end_get=time.time() print(end_get-start_get,start_get-wait_time,wait_time-remote_time,len(l))
here is the result,
with IB, the wait time is short but varies enormously. The remote time with IB is longer than the ethernet. The wait time and remote time with ethernet are long but smooth.
The results confused me and someone can explain this?