I have a local cluster, 19 nodes including the head, with 436 total workers… I’ve confirmed they are alive and available… I have created a large queue of tasks, via a list of calls to some function.remote, and then am processing the results via ray.wait … when looking the dashboard, it is only showing 76/436 being utilized… and really only 4 of 19 nodes… the rest are all showing 0/24 … I’m on Ray 1.2 … the function iteself is chunking a .csv file and returning some match if available. Any suggestions to troubleshoot?
Ray 1.2 (and until 1.3), Ray prefers to saturate nodes one by one. That says tasks won’t likely to go other nodes until the first N nodes are fully saturated.
At 1.3, you can change the behavior to randomly load balance tasks to all nodes in the cluster by starting a head node with system-config
ray start --head --system-config='{"scheduler_loadbalance_spillback":true}'
Note 1.3 will be released in about 2~ish weeks.
thanks for that update… but if the tasks are significantly greater than the number of workers, i.e. I submitted like 240k tasks to 436 workers… why does the dashboard seem to only show PID for 76-100 out of 436 workers … when I took a different tact… and instead moved the task to an Actor, and then created an ActorPool of ~400 workers… I then see the cluster being saturated with work on the dashboard… I’m just wondering is it just a dashboard thing… when submitting tasks, is it really not utilizing the cluster, or is it just a reporting issue?
Hmm I see. I have a couple questions;
- How long are your tasks?
- Can you also try running
ray status
and see if the number of nodes are all correct?