Hey guys! I am trying to use Ray for scaling up computation to larger number of nodes on a cluster. What is happening that all the ray actors runs for a while, and after that one of the ray actor dies somehow, which causes the whole program to shut down. I am pretty sure the program is not running out of memory(As I don’t see any OOMKiller log in dmesg).
(raylet, ip=172.29.58.146) [2022-10-05 02:57:21,559 E 279276 279315] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
(pid=gcs_server) [2022-10-05 02:57:22,411 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:23,414 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:24,419 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:25,422 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:26,426 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:27,439 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:28,437 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:29,442 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:30,446 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:31,450 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:32,454 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:33,458 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:34,460 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:35,464 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:36,469 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:37,473 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:38,477 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:39,484 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:40,488 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:41,492 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:42,498 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:43,502 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:44,510 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:45,515 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:46,519 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:47,524 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:48,530 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-10-05 02:57:49,534 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
2022-10-05 02:57:49,972 WARNING worker.py:1404 -- The node with node id: 199d89f99d2e6c7d08bdbb19c9fa50ca3907926fac2687fc76f626ea and address: 172.29.58.146 and node name: 172.29.58.146 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
Traceback (most recent call last):
File "mlm_training.py", line 12, in <module>
head.train()
File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/distributed_bolt.py", line 70, in train
trainer.train(epoch, batch_id, self.learning_rate)
File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/trainer.py", line 51, in train
self._calculate_gradients(batch_id)
File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/trainer.py", line 64, in _calculate_gradients
ray.get(
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 1833, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Could anyone point out the place to look for figuring out what’s happening?