Local Ray cluster won't send any tasks to node

I am experimenting with Ray and have set up a local cluster using two laptops. Laptop A is the head, and laptop B is a worker node.

I have found that the head node is allocating all of its tasks to itself (to laptop A), and no tasks are being done by laptop B (the worker node). In the dashboard, I see that laptop B is correctly attached to the cluster, and lists 8 workers (correlating to its 8 CPU cores). However, they are all idle.

In the logs, I saw the following messages:

[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = 878648274e71834811460c710aa974af98c0707478430210a9a6b288, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker 878648274e71834811460c710aa974af98c0707478430210a9a6b288 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:06,617 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:06,618 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:17,113 I 14831 5032034] (gcs_server) gcs_server.cc:188: GcsNodeManager: 
- RegisterNode request count: 3
- DrainNode request count: 1
- GetAllNodeInfo request count: 273
- GetInternalConfig request count: 5

What would be causing the worker tasks to quit in this manner? Obviously the cluster is supposed to distribute tasks to all of the workers.

2 Likes

I have the same issue, do you still have the same issue? I have no clue how to solve this and literally i can’t find any help regarding this.