Some workers are not assigned to any task

Hi, I use ‘ray start’ to launch a cluster of 4 nodes and all nodes are on the dashboard and their status are refreshed correctly. However, when I launch about 100 tasks to the cluster, there’s one or two nodes that no task is assigned to and during this time these hosts have no other work (cpu is very low). From the worker nodes’ log, I didn’t see any error printed, nor did I use any plancement group.

Any idea how I should investigate?

Thanks,
-BS

Hi @blshao84, tasks are scheduled by Ray in order to minimize cross-node communication and leverage object store locality. Thus, if you have enough resources (e.g. 256 CPUs) and only schedule 100 tasks, this may happen. This usually shouldn’t be a problem as each task per default only uses one CPU anyway - so for your workload it usually doesn’t matter if it is running on a CPU on a node where each other CPU is occupied or if it is the only task running on the node.

Obviously this is not true when other factors, such as network bandwidth, come into play.

What you can do here is to use placement groups with a SPREAD strategy. This tries to spread the tasks as much as possible across nodes, so should be exactly what you’re looking for.

If this is not what you’re looking for or need help setting this up, it would be great if you could share a little bit more about your use case.

Thanks Kai, I will definitely try the plancement groups, but in my case I have 100 tasks but the cluster only has 32 cpus. I will send more info after I tried placement group.

Thanks,
-BS

Can you also print the output of ray.cluster_resources()?

Hi @rliaw, this is what I got and node ‘10.2.237.188’ and ‘10.23.186.153’ have 4 cpus and are problematic nodes that don’t get any task. Let me know if you need more info.

{‘CPU’: 32.0, ‘memory’: 68120886889.0, ‘object_store_memory’: 31923500235.0, ‘node:10.23.237.188’: 1.0, ‘node:10.23.113.84’: 1.0, ‘node:10.23.190.245’: 1.0, ‘node:10.23.186.153’: 1.0, ‘node:10.23.219.58’: 1.0}

Are your tasks very short?

No, each task takes about 10-20 seconds. It turns out to be some wired host issue. After I removed those two problematic hosts from the cluster and added two new ones, everything works fine. All hosts are able to receive tasks.

The remaining issue here is why I cannot find any error from log.

1 Like

I guess we don’t really have good error message for connectivity issues like this. Do you have any good suggestion that we could’ve done better here? (like showing that this host is not well connected to the head node or sth like that? )

I have confirmed it is a connectivity issue and after I stop the firewall, all hosts are able to process tasks. However, I’m still unable to find out which port is blocked.

I notice there’s a debug_state.txt in the log, maybe we can dump some connectivity msg there. For example, it checks all ports ray workers require. Or even better it passes this info to the dashboard. What really stuck me in the first place, is that it looks everything is working fine from the dashboard, but somehow there’s still connectivity issue.

Hmm, did you set worker-port-range for each of your ray nodes?

no, I didn’t. The issue itself is caused by my firewall setting, but the problem is there isn’t any error either from the dashboard or the log.