Local Ray cluster won't send any tasks to worker node

I am experimenting with Ray and have set up a cluster on my LAN, connecting two laptops. Laptop A is the head node, and laptop B is a worker node.

They are connected correctly, as far as I can tell, and if you go to the dashboard, it lists both computers with 8 worker processes each (reflecting the 8 CPU cores in each machine).

However, when I run run my Python script with ray.init (params), it exhibits strange behavior. Ray sends all of the tasks to laptop A (the head node), which is fully utilized. However, laptop B (the worker node) is completely idle. Nothing gets sent there.

I checked the logs, and in gcs_server.out, I found that it seems that the processes are exiting: (note the ip address here is the ip of the worker node)

[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = 878648274e71834811460c710aa974af98c0707478430210a9a6b288, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker 878648274e71834811460c710aa974af98c0707478430210a9a6b288 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:06,617 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:06,618 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:17,113 I 14831 5032034] (gcs_server) gcs_server.cc:188: GcsNodeManager: 
- RegisterNode request count: 3
- DrainNode request count: 1
- GetAllNodeInfo request count: 273
- GetInternalConfig request count: 5

What could be causing this kind of error? How can I get Ray to send tasks to the worker nodes?

@sangcho / @Chen_Shen , perhaps you might know how to answer this one?

Do you have any logs for the workers on laptop B? Usually, there will be some errors tell why worker failed (and seems this is root cause in your case).

https://docs.ray.io/en/latest/ray-core/configure.html#ray-ports

Did you make sure all ports are properly open?

Can you give me these information.

Num of cpus per node
What’s your workload?

@sangcho I am sohail and I am working on a cancer clinical research data, and we have high computation requirements and I am trying to setup local ray cluster but am not able to. could you pair computer with me and help me setup a local cluster please?

Sure. @sohail_4233 are you in a Ray slack channel? Can you ping @sangcho there?

Hi @jalustig, I’m going to mark this as resolved. If you would like more guidance feel free to respond here or open another question.