How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
OS: Win 10
CPU: AMD 64-Core
Env: Anaconda (the latest version). “base” env
Ray: 2.0.0 for cp39. I installed via downloading the whl version and then pip install.
Harddrive: both C and D drives has more than 128GB available.
I kept the num_cpus small. And I also made sure (by eye-watching) at any time the Windows task manager resource viewer’s Memory Usage is < 50%.
My code snippet:
NUM_WORKERS = multiprocessing.cpu_count()
objref1 = ray.put(huge_array1)
objref2 = ray.put(huge_array2)
objref3 = ray.put(huge_array3)
objref4 = ray.put(huge_array4)
for i in range(10000):
futures.append(process_one_point_at_a_time.remote(i, objref1, objref2, objref3, objref4 ))
results = ray.get(futures)
Without any error, it runs slower and slower and eventually reached freezing mode when i increases to 1000. At that time, the Windows Task Manager resource viewer shows CPU usage is around 20% and memory usage is around 30%.
Inside the loop, there is Scikit-Learn training function, which has a parallel mechanism itself(multithreading?), in which I assigned n_jobs = 40 to it.
So my estimation was: Ray uses num_cpus=20, and Scikit-Learn training uses n_jobs=40, together they sum to 60, which is less than 64.
I have timed it and found that one iteration executes around 70 seconds. So initially Ray runs very fast with issuing 20 task in parallel. And then it gets slower and slower, and finally it freezes at 1000.
It seems that Ray does not allow dispatching all 10000 tasks all at the same time?
It looks it might be better to send out jobs in batches, for example, every 100 jobs in one batch?
Because the job stalled, now it is even worse than single process, which defeats the initial goal of using Ray, I am beating my head against wall.
Please help me and shed some lights on me!
Thanks a lot!