Some workers are not assigned to any task

blshao84 · May 19, 2021, 9:25am

Hi, I use ‘ray start’ to launch a cluster of 4 nodes and all nodes are on the dashboard and their status are refreshed correctly. However, when I launch about 100 tasks to the cluster, there’s one or two nodes that no task is assigned to and during this time these hosts have no other work (cpu is very low). From the worker nodes’ log, I didn’t see any error printed, nor did I use any plancement group.

Any idea how I should investigate?

Thanks,
-BS

kai · May 19, 2021, 10:05am

Hi @blshao84, tasks are scheduled by Ray in order to minimize cross-node communication and leverage object store locality. Thus, if you have enough resources (e.g. 256 CPUs) and only schedule 100 tasks, this may happen. This usually shouldn’t be a problem as each task per default only uses one CPU anyway - so for your workload it usually doesn’t matter if it is running on a CPU on a node where each other CPU is occupied or if it is the only task running on the node.

Obviously this is not true when other factors, such as network bandwidth, come into play.

What you can do here is to use placement groups with a SPREAD strategy. This tries to spread the tasks as much as possible across nodes, so should be exactly what you’re looking for.

If this is not what you’re looking for or need help setting this up, it would be great if you could share a little bit more about your use case.

blshao84 · May 19, 2021, 10:55am

Thanks Kai, I will definitely try the plancement groups, but in my case I have 100 tasks but the cluster only has 32 cpus. I will send more info after I tried placement group.

Thanks,
-BS

rliaw · May 19, 2021, 5:40pm

Can you also print the output of ray.cluster_resources()?

blshao84 · May 19, 2021, 10:44pm

Hi @rliaw, this is what I got and node ‘10.2.237.188’ and ‘10.23.186.153’ have 4 cpus and are problematic nodes that don’t get any task. Let me know if you need more info.

{‘CPU’: 32.0, ‘memory’: 68120886889.0, ‘object_store_memory’: 31923500235.0, ‘node:10.23.237.188’: 1.0, ‘node:10.23.113.84’: 1.0, ‘node:10.23.190.245’: 1.0, ‘node:10.23.186.153’: 1.0, ‘node:10.23.219.58’: 1.0}

sangcho · May 20, 2021, 8:50am

Are your tasks very short?

blshao84 · May 20, 2021, 12:38pm

No, each task takes about 10-20 seconds. It turns out to be some wired host issue. After I removed those two problematic hosts from the cluster and added two new ones, everything works fine. All hosts are able to receive tasks.

The remaining issue here is why I cannot find any error from log.

sangcho · May 20, 2021, 5:03pm

I guess we don’t really have good error message for connectivity issues like this. Do you have any good suggestion that we could’ve done better here? (like showing that this host is not well connected to the head node or sth like that? )

blshao84 · May 21, 2021, 4:07am

I have confirmed it is a connectivity issue and after I stop the firewall, all hosts are able to process tasks. However, I’m still unable to find out which port is blocked.

I notice there’s a debug_state.txt in the log, maybe we can dump some connectivity msg there. For example, it checks all ports ray workers require. Or even better it passes this info to the dashboard. What really stuck me in the first place, is that it looks everything is working fine from the dashboard, but somehow there’s still connectivity issue.

rliaw · May 21, 2021, 10:48pm

Hmm, did you set worker-port-range for each of your ray nodes?

blshao84 · May 24, 2021, 5:33am

no, I didn’t. The issue itself is caused by my firewall setting, but the problem is there isn’t any error either from the dashboard or the log.

Topic		Replies	Views
Cluster not being utilized Ray Core	4	455	March 30, 2021
How to assign tasks to node evenly Ray Clusters	1	572	January 18, 2023
Local Ray cluster won't send any tasks to node Ray Clusters	1	381	May 22, 2022
Task distribution is not happening with new nodes Ray Core	12	1654	November 26, 2021
Task distribution Ray Clusters	3	31	July 29, 2024

Some workers are not assigned to any task

Related topics