Cluster not being utilized

Jeffrey_Waldron · March 27, 2021, 2:20pm

I have a local cluster, 19 nodes including the head, with 436 total workers… I’ve confirmed they are alive and available… I have created a large queue of tasks, via a list of calls to some function.remote, and then am processing the results via ray.wait … when looking the dashboard, it is only showing 76/436 being utilized… and really only 4 of 19 nodes… the rest are all showing 0/24 … I’m on Ray 1.2 … the function iteself is chunking a .csv file and returning some match if available. Any suggestions to troubleshoot?

sangcho · March 27, 2021, 10:20pm

Ray 1.2 (and until 1.3), Ray prefers to saturate nodes one by one. That says tasks won’t likely to go other nodes until the first N nodes are fully saturated.

At 1.3, you can change the behavior to randomly load balance tasks to all nodes in the cluster by starting a head node with system-config


ray start --head --system-config='{"scheduler_loadbalance_spillback":true}'

sangcho · March 27, 2021, 10:21pm

Note 1.3 will be released in about 2~ish weeks.

Jeffrey_Waldron · March 28, 2021, 1:38pm

thanks for that update… but if the tasks are significantly greater than the number of workers, i.e. I submitted like 240k tasks to 436 workers… why does the dashboard seem to only show PID for 76-100 out of 436 workers … when I took a different tact… and instead moved the task to an Actor, and then created an ActorPool of ~400 workers… I then see the cluster being saturated with work on the dashboard… I’m just wondering is it just a dashboard thing… when submitting tasks, is it really not utilizing the cluster, or is it just a reporting issue?

sangcho · March 30, 2021, 7:23pm

Hmm I see. I have a couple questions;

How long are your tasks?
Can you also try running ray status and see if the number of nodes are all correct?

Topic		Replies	Views
Cluster usage is not 100% rather 57% Ray Clusters	0	417	October 21, 2021
Some workers are not assigned to any task Ray Core	10	614	May 24, 2021
Task distribution is not happening with new nodes Ray Core	12	1678	November 26, 2021
Worker nodes not utilized RLlib	1	291	June 10, 2022
Ray cluster uses only Head node Ray Clusters	3	445	June 28, 2021

Cluster not being utilized

Related topics