Ray Actor Dying unexpectedly

Hey guys! I am trying to use Ray for scaling up computation to larger number of nodes on a cluster. What is happening that all the ray actors runs for a while, and after that one of the ray actor dies somehow, which causes the whole program to shut down. I am pretty sure the program is not running out of memory(As I don’t see any OOMKiller log in dmesg).

(raylet, ip=172.29.58.146) [2022-10-05 02:57:21,559 E 279276 279315] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
(pid=gcs_server) [2022-10-05 02:57:22,411 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:23,414 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:24,419 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:25,422 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:26,426 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:27,439 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:28,437 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:29,442 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:30,446 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:31,450 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:32,454 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:33,458 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:34,460 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:35,464 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:36,469 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:37,473 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:38,477 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:39,484 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:40,488 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:41,492 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:42,498 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:43,502 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:44,510 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:45,515 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:46,519 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:47,524 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:48,530 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-10-05 02:57:49,534 E 1570095 1570095] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
2022-10-05 02:57:49,972 WARNING worker.py:1404 -- The node with node id: 199d89f99d2e6c7d08bdbb19c9fa50ca3907926fac2687fc76f626ea and address: 172.29.58.146 and node name: 172.29.58.146 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
Traceback (most recent call last):
  File "mlm_training.py", line 12, in <module>
    head.train()
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/distributed_bolt.py", line 70, in train
    trainer.train(epoch, batch_id, self.learning_rate)
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/trainer.py", line 51, in train
    self._calculate_gradients(batch_id)
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/trainer.py", line 64, in _calculate_gradients
    ray.get(
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 1833, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Could anyone point out the place to look for figuring out what’s happening?

hi @pratkpranav how many nodes/actors are you running in the cluster? You might have hit some Ray limitations

In your particular case, it’s very likely the gcs is overloaded. In this case there might be some logs in /tmp/ray/session_latest/log/gcs_server.out.

Hey! Thanks for reply. I am running my workload on 64 nodes(each with one ray actor), each having 4 vcpus.

I looked at the benchmarks, only things which we might be hitting could be Object Store benchmarks. Right now, while communicating each of the node is put 4 GiB of data in the object store, But one thing to be noted here, is they don’t put all the data at once, they do it in cycle. That is, there are 64 cycles, where in each cycle, each of the node is putting around 70MiB of data in the object store.

I saw the gcs_server.out am not able to find anything meaningful there.
I am adding all the logs from my last run here too, with following configuration.

setting RAY_BACKEND_LOG_LEVEL=debug.
I looked up that I was using the ray version 1.13.0, so I moved it up to 2.0.0 too,.

This is Error output I am getting now: Error Output - Pastebin.com
This is dashboard_agent.log (on 172.29.58.148 (Node that failed)): dashboard_agent.log on worker IP which failed - Pastebin.com
This is dashboard_agent.log (on Head Node): dashboard_agent.log on head worker which failed - Pastebin.com
raylet.out output (on 172.29.58.148 (Node that failed)): raylet.out output on failed node - Pastebin.com
python-core-worker Log (on 172.29.58.148 (Node that failed)):python core worker log - Pastebin.com
python-core-driver Log (on head worker): https://pastebin.com/9RVJmHRCI am not able to figure anything much from these logs. Where should I need to look into? Could you please have a look at these logs too?
gcs_server.out log: gcs_server outputlog - Pastebin.com

hi @pratkpranav

looking at the logs it seems what happened is agent process crashed and brought down the raylet process

raylet.out:

[2022-10-05 10:30:47,880 W 288188 288227] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip 172.29.58.148. id 424238335
[2022-10-05 10:30:47,880 E 288188 288227] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
[2022-10-05 10:30:47,880 D 288188 288227] (raylet) logging.cc:323: Uninstall signal handlers.

agent.log

2022-10-05 08:11:31,109	INFO runtime_env_agent.py:410 -- Runtime env already created successfully. Env: {"env_vars": {"OMP_NUM_THREADS": "4"}}, context: {"command_prefix": [], "env_vars": {"OMP_NUM_THREADS": "4"}, "py_executable": "/usr/bin/python3", "resources_dir": null, "container": {}, "java_jars": []}
2022-10-05 10:30:47,443	ERROR agent.py:217 -- Raylet is terminated: ip=172.29.58.148, id=d2a5c564a71c468ddacdab4a7f8e1c29c69b626acc3d5f1457731b28. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
...

And the sequence is agent first detected Raylet is dead at 10:30:47,443. Interestingly the raylet.out log where the node crashed started at 10:30:47,818; which happened after the agent detected raylet failure.

I wonder if you have more logs, or other raylet related files on the node that raylet failed?

Hi @Chen_Shen, Thanks for replying.

Sorry, but the logs from that run are already deleted from the cluster. I was able to solve the issues with the following changes.

It looks like the initial warning for the ray actor marked as dead seems to be coming due to the default heartbeat timeout being 30s. The warnings stopped as soon as I increased the timeout to 1000s(as per [gcp] Node mistakenly marked dead: increase heartbeat timeout? · Issue #16945 · ray-project/ray · GitHub).

Also, for program ending abruptly. I had this max_retries set to 2 for each actor. However, I was deleting the references for each of the objects manually(the objects were big)using del obj_ref, as soon as I removed that, the program didn’t go down. However, warnings did keep popping up from one node or another. It might be happening that without object reference, object reconstruction during retries might not be happening.

I still get this warning once or twice while running the program; maybe I should increase the warning timeout too?

Hey @Chen_Shen
It’s really weird but its look like, I am again getting this issue. I have added logs for worker which failed from my last run here. Dropbox - log_vm - Simplify your life

Hmm yeah this is pretty weird. Only guess I have is the agent’s raylet health detecting logic has a flaw and it falsefully thought raylet was dead when it wasn’t.

Is there a way to repro this issue on our end?

This scripts here fails similarly. [Ray Core] Ray agent getting killed unexpectedly · Issue #29412 · ray-project/ray · GitHub

Not exact though. It doesn’t gives warning, it just fails. Lemme see if I could produce a more closer reproduction script.