Local Ray cluster won't send any tasks to worker node

I am experimenting with Ray and have set up a cluster on my LAN, connecting two laptops. Laptop A is the head node, and laptop B is a worker node.

They are connected correctly, as far as I can tell, and if you go to the dashboard, it lists both computers with 8 worker processes each (reflecting the 8 CPU cores in each machine).

However, when I run run my Python script with ray.init (params), it exhibits strange behavior. Ray sends all of the tasks to laptop A (the head node), which is fully utilized. However, laptop B (the worker node) is completely idle. Nothing gets sent there.

I checked the logs, and in gcs_server.out, I found that it seems that the processes are exiting: (note the ip address here is the ip of the worker node)

[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = 878648274e71834811460c710aa974af98c0707478430210a9a6b288, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker 878648274e71834811460c710aa974af98c0707478430210a9a6b288 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:06,617 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:06,618 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:17,113 I 14831 5032034] (gcs_server) gcs_server.cc:188: GcsNodeManager: 
- RegisterNode request count: 3
- DrainNode request count: 1
- GetAllNodeInfo request count: 273
- GetInternalConfig request count: 5

What could be causing this kind of error? How can I get Ray to send tasks to the worker nodes?

@sangcho / @Chen_Shen , perhaps you might know how to answer this one?

Do you have any logs for the workers on laptop B? Usually, there will be some errors tell why worker failed (and seems this is root cause in your case).

https://docs.ray.io/en/latest/ray-core/configure.html#ray-ports

Did you make sure all ports are properly open?

Can you give me these information.

Num of cpus per node
What’s your workload?

@sangcho I am sohail and I am working on a cancer clinical research data, and we have high computation requirements and I am trying to setup local ray cluster but am not able to. could you pair computer with me and help me setup a local cluster please?

Sure. @sohail_4233 are you in a Ray slack channel? Can you ping @sangcho there?

Hi @jalustig, I’m going to mark this as resolved. If you would like more guidance feel free to respond here or open another question.

Hello, Can you tell me if you were able to solve this issue? I am facing the same issue where my worker node is not picking up tasks. But is defn showing up in the dashboard

what’s your ray cluster configuration (num nodes, node size, cpu/gpu/mem requirements)?

======== Autoscaler status: 2024-08-06 12:15:10.592439 ========

Node status

Active:
1 node_e5be676c8993e860e462ae5eeeeee167f84c4328743d841e70755b8f
1 node_4a7d7a73c6ea21822e15fc92122eb465895d152fa0a084bfcb401187
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
3.0/16.0 CPU
0B/14.68GiB memory
55.87MiB/6.76GiB object_store_memory

Total Demands:
(no resource demands)

Node: e5be676c8993e860e462ae5eeeeee167f84c4328743d841e70755b8f (node_e5be676c8993e860e462ae5eeeeee167f84c4328743d841e70755b8f)
Usage:
3.0/8.0 CPU
0B/6.58GiB memory
55.87MiB/3.29GiB object_store_memory

Node: 4a7d7a73c6ea21822e15fc92122eb465895d152fa0a084bfcb401187 (node_4a7d7a73c6ea21822e15fc92122eb465895d152fa0a084bfcb401187)
Usage:
0.0/8.0 CPU
0B/8.10GiB memory
0B/3.47GiB object_store_memory

I get this status when i run ray status -v.
Now, when i run my script, the first node alone shows the cpu usages. No CPU is used from the worker node.

The status is showing that the head has 3 CPU worth of workload and the worker is idle. I think Ray is biased to schedule as much of the workload as possible on the same node to avoid unnecessary network traffic. Try disabling running tasks on the head node by passing --num-cpus=0 to ray start for the head to force the schedule to use the worker. This may highlight additional issues with the configuration.

2 Likes