Help! Ray 2.0.0 on Windows 10 runs slower and slower and then hangs

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

OS: Win 10
CPU: AMD 64-Core
Memory: 128GB
Env: Anaconda (the latest version). “base” env
Python: 3.9.12
Ray: 2.0.0 for cp39. I installed via downloading the whl version and then pip install.
Harddrive: both C and D drives has more than 128GB available.

I kept the num_cpus small. And I also made sure (by eye-watching) at any time the Windows task manager resource viewer’s Memory Usage is < 50%.

My code snippet:

NUM_WORKERS = multiprocessing.cpu_count()

ray.init(num_cpus=20)#NUM_WORKERS)

all the huge_arrays are numpy arrays. And they total around 15-20GB.

objref1 = ray.put(huge_array1)
objref2 = ray.put(huge_array2)
objref3 = ray.put(huge_array3)
objref4 = ray.put(huge_array4)

futures =

for i in range(10000):
futures.append(process_one_point_at_a_time.remote(i, objref1, objref2, objref3, objref4 ))

results = ray.get(futures)


Observations:

  1. Without any error, it runs slower and slower and eventually reached freezing mode when i increases to 1000. At that time, the Windows Task Manager resource viewer shows CPU usage is around 20% and memory usage is around 30%.

  2. Inside the loop, there is Scikit-Learn training function, which has a parallel mechanism itself(multithreading?), in which I assigned n_jobs = 40 to it.
    So my estimation was: Ray uses num_cpus=20, and Scikit-Learn training uses n_jobs=40, together they sum to 60, which is less than 64.

  3. I have timed it and found that one iteration executes around 70 seconds. So initially Ray runs very fast with issuing 20 task in parallel. And then it gets slower and slower, and finally it freezes at 1000.

It seems that Ray does not allow dispatching all 10000 tasks all at the same time?

It looks it might be better to send out jobs in batches, for example, every 100 jobs in one batch?

Because the job stalled, now it is even worse than single process, which defeats the initial goal of using Ray, I am beating my head against wall.

Please help me and shed some lights on me!

Thanks a lot!

Thanks for the info! It should not be a problem to run 10k tasks (or more) at a time. I tried running a version of your script that uses an empty process_one_point_at_a_time, but I did not see the same issue, so it may be something to do with what’s happening inside your function.

Can you say a bit more about the remote function? It would be even better if you could provide the code directly. Thanks!