Local Ray cluster won't send any tasks to node

jalustig · March 13, 2022, 3:10am

I am experimenting with Ray and have set up a local cluster using two laptops. Laptop A is the head, and laptop B is a worker node.

I have found that the head node is allocating all of its tasks to itself (to laptop A), and no tasks are being done by laptop B (the worker node). In the dashboard, I see that laptop B is correctly attached to the cluster, and lists 8 workers (correlating to its 8 CPU cores). However, they are all idle.

In the logs, I saw the following messages:

[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = 878648274e71834811460c710aa974af98c0707478430210a9a6b288, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker 878648274e71834811460c710aa974af98c0707478430210a9a6b288 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:06,617 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:06,618 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:17,113 I 14831 5032034] (gcs_server) gcs_server.cc:188: GcsNodeManager: 
- RegisterNode request count: 3
- DrainNode request count: 1
- GetAllNodeInfo request count: 273
- GetInternalConfig request count: 5

What would be causing the worker tasks to quit in this manner? Obviously the cluster is supposed to distribute tasks to all of the workers.

sohail_4233 · May 22, 2022, 11:45am

I have the same issue, do you still have the same issue? I have no clue how to solve this and literally i can’t find any help regarding this.

Topic		Replies	Views
Local Ray cluster won't send any tasks to worker node Ray Clusters	11	944	August 6, 2024
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2160	May 25, 2023
Remote Worker Nodes die after a few seconds Ray Clusters	5	1965	July 17, 2024
Ray cluster number issue Ray Clusters	6	434	June 6, 2022
Ray cluster is not found at node Ray Clusters	0	172	January 11, 2024

Local Ray cluster won't send any tasks to node

Related topics