Worker gets killed unexpectedly

Hello, I have a Ray cluster operating normally and is able to auto-scale worker nodes correctly, as per the rules in the YAML file used to launch the cluster. I would like to also “attach” additional worker nodes manually to the running cluster. For instance, I would like to attach my local machine to the cluster so the cluster can deploy Serve endpoints or other things to my local machine, along with stuff on its auto-scaled nodes.
To do this, I do:
ray start --address=<head_node_ip>:6379 --block
The process starts off ok, the worker is able to connect to the head node, but eventually (after 30 seconds or so) the worker stops / crashes. It is VERY DIFFICULT to find any information on the head node dashboard regarding this crash, it seems the only information I have from the head node is:
Unexpected termination: health check failed due to missing too many heartbeats
if I turn on verbose output on the worker, I get the following crash information:

/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x20fcef) [0x58d64b79ecef] main
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x768433892d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x768433892e40] __libc_start_main
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x2731a7) [0x58d64b8021a7]
2025-08-07 06:53:51,676 INFO (runtime_env_agent) main.py:210 -- Raylet is dead! Exiting Runtime Env Agent. addr: <WORKER_IP>, port: 42490
_check_parent_via_pipe: The parent is dead.
2025-08-07 06:53:51,677 INFO (dashboard_agent) agent.py:211 -- Terminated Raylet: ip=<WORKER_IP>, node_id=e14d1dc3d205d02a130d137daf35b25e1b9a2a570ceddea6259d12dd. _check_parent_via_pipe: The parent is dead.
Raylet is terminated. Failed to read Raylet logs at /tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'!
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/process_watcher.py", line 83, in report_raylet_error_logs
    with open(log_path, "r", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'
Raylet is terminated. Failed to read Raylet logs at /tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'!
2025-08-07 06:53:51,677 ERROR (dashboard_agent) process_watcher.py:112 -- Raylet is terminated. Failed to read Raylet logs at /tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'!
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/process_watcher.py", line 83, in report_raylet_error_logs
    with open(log_path, "r", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'
2025-08-07 06:53:51,678 ERROR (dashboard_agent) process_watcher.py:115 -- Raylet is terminated. Failed to read Raylet logs at /tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'!
2025-08-07 06:53:51,680 INFO (runtime_env_agent) main.py:232 -- SystemExit! 0

If anyone could shed some light on this issue, I’d appreciate it.

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.48.0
  • Python version: 3.12.9
  • OS: linux ubuntu
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant): uv

3. What happened vs. what you expected:

  • Expected: the worker node registerers into the cluster and sends heartbeat
  • Actual: the worker node begins registration process but gets killed after 30-45 seconds

all ports / IP addresses are reachable by both machines.

0-65535 → 0.0.0.0/0

Hello!

The error “health check failed due to missing too many heartbeats” means the head node is not receiving heartbeats from your manually attached worker, so it marks the worker as dead and removes it from the cluster. This happens sometimes when manually attaching nodes to a Ray cluster managed by the autoscaler, especially if the node’s environment or configuration doesn’t match the autoscaler’s expectations, or if there are resource/contention/network/etc issues.

I recommend that you double-check that all firewalls (including OS-level, cloud security groups, and any group policies) allow all TCP traffic between all nodes, and that the correct Python executables are whitelisted. If using a virtual environment, ensure both the base and venv Python executables are allowed. This was the root cause in a nearly identical scenario, where the worker node was marked dead after 30 seconds due to missed heartbeats, and the logs were inaccessible until the correct firewall rules were set for all Python executables involved: Remote worker nodes only alive for 30 seconds - #4 by bananajoe182

According to Ray cluster FAQ, and Ray on-prem cluster guide, you should ensure the worker’s Ray version, Python version, and environment match the head node too. Let me know if those 2 guides help.

If you’re still having this issue please lmk and we can try to debug further! :slight_smile:

everything spawn in my cluster (head node, auto-scaled workers and manual workers) are all sharing the exact same Docker (everything is run in pre-configured rayproject/ray-gpu dockers), so differences in configuration is highly unlikely, unless the autoscaler magically configures something on the worker node I am not able to detect).

For the ports, all communication that I can test work perfectly fine, everything is wide open (for debugging), so it should be fine. But it’s hard to debug the “random” port the Ray head node opens to talk to the worker.

I’ll debug some more and report back.

more info:

[2025-08-07 14:32:39,172 D 83 83] (raylet) worker_pool.cc:1190: Idle workers: 0, idle workers that are eligible to kill: 0, num desired workers : 2
[2025-08-07 14:32:39,363 D 83 83] (raylet) subscriber.cc:377: Long polling request has been replied from 542229e7d1afcd3dd7c6664aed2778eda0a4530e354e268e307b9560
[2025-08-07 14:32:39,363 D 83 83] (raylet) subscriber.cc:358: Make a long polling request to 542229e7d1afcd3dd7c6664aed2778eda0a4530e354e268e307b9560
[2025-08-07 14:32:39,363 I 83 83] (raylet) accessor.cc:784: Received notification for node, IsAlive = 0 node_id=bc994716047879ac6b8154eec36b0b0cbcc35edd3c20162c152bf7b2
[2025-08-07 14:32:39,363 D 83 83] (raylet) node_manager.cc:808: [NodeRemoved] Received callback from node id  node_id=bc994716047879ac6b8154eec36b0b0cbcc35edd3c20162c152bf7b2
[2025-08-07 14:32:39,421 C 83 83] (raylet) node_manager.cc:821: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xea292a) [0x56ba7ceba92a] ray::operator<<()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xea4d19) [0x56ba7cebcd19] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x30878b) [0x56ba7c32078b] ray::raylet::NodeManager::NodeRemoved()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x61d735) [0x56ba7c635735] ray::gcs::NodeInfoAccessor::HandleNotification()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x7aad98) [0x56ba7c7c2d98] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x7a2e57) [0x56ba7c7bae57] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x7a3deb) [0x56ba7c7bbdeb] boost::asio::detail::executor_op<>::do_complete()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xe7ebab) [0x56ba7ce96bab] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xe81139) [0x56ba7ce99139] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xe81652) [0x56ba7ce99652] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x20fcef) [0x56ba7c227cef] main

not sure why this is happening, the worker docker is not overloaded…. any ideas?

Hmmm.. can you try doing these and let me know if it changes anything? Set the environment variables like RAY_health_check_period_ms=500, RAY_health_check_timeout_ms=1000, and RAY_health_check_failure_threshold=10 before starting Ray. They mention it a bit in this comment, so I think it might be worth a shot: [Ray cluster] Worker node is disappearing after some seconds · Issue #45179 · ray-project/ray · GitHub

It might be worth a shot, but I will continue to dig and see if I can find anything else out for you