I am experimenting with Ray and have set up a cluster on my LAN, connecting two laptops. Laptop A is the head node, and laptop B is a worker node.
They are connected correctly, as far as I can tell, and if you go to the dashboard, it lists both computers with 8 worker processes each (reflecting the 8 CPU cores in each machine).
However, when I run run my Python script with ray.init (params), it exhibits strange behavior. Ray sends all of the tasks to laptop A (the head node), which is fully utilized. However, laptop B (the worker node) is completely idle. Nothing gets sent there.
I checked the logs, and in gcs_server.out
, I found that it seems that the processes are exiting: (note the ip address here is the ip of the worker node)
[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,609 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker e2d215e6bb40149a6636f62b61dddac334940b2c88b400d8e0f53fd6 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,623 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker ef085802f93c789d953121ccb442b9fab59d2cbdd7c8932440fcd8af on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = 878648274e71834811460c710aa974af98c0707478430210a9a6b288, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:05,628 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker 878648274e71834811460c710aa974af98c0707478430210a9a6b288 on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:06,617 W 14831 5032034] (gcs_server) gcs_worker_manager.cc:37: Reporting worker exit, worker id = badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c, node id = 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c, address = 192.168.4.172, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-03-12 20:38:06,618 W 14831 5032034] (gcs_server) gcs_actor_manager.cc:828: Worker badcc1527402efe18714fb166b320a6e5246b365207a955c2faa180c on node 0b8582a5cb1972716c8750228e6c2491487772ead484b985080f684c exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0
[2022-03-12 20:38:17,113 I 14831 5032034] (gcs_server) gcs_server.cc:188: GcsNodeManager:
- RegisterNode request count: 3
- DrainNode request count: 1
- GetAllNodeInfo request count: 273
- GetInternalConfig request count: 5
What could be causing this kind of error? How can I get Ray to send tasks to the worker nodes?