Worker gets killed unexpectedly

Dominic_Laflamme · August 7, 2025, 2:07pm

Hello, I have a Ray cluster operating normally and is able to auto-scale worker nodes correctly, as per the rules in the YAML file used to launch the cluster. I would like to also “attach” additional worker nodes manually to the running cluster. For instance, I would like to attach my local machine to the cluster so the cluster can deploy Serve endpoints or other things to my local machine, along with stuff on its auto-scaled nodes.
To do this, I do:
ray start --address=<head_node_ip>:6379 --block
The process starts off ok, the worker is able to connect to the head node, but eventually (after 30 seconds or so) the worker stops / crashes. It is VERY DIFFICULT to find any information on the head node dashboard regarding this crash, it seems the only information I have from the head node is:
Unexpected termination: health check failed due to missing too many heartbeats
if I turn on verbose output on the worker, I get the following crash information:

/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x20fcef) [0x58d64b79ecef] main
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x768433892d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x768433892e40] __libc_start_main
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x2731a7) [0x58d64b8021a7]
2025-08-07 06:53:51,676 INFO (runtime_env_agent) main.py:210 -- Raylet is dead! Exiting Runtime Env Agent. addr: <WORKER_IP>, port: 42490
_check_parent_via_pipe: The parent is dead.
2025-08-07 06:53:51,677 INFO (dashboard_agent) agent.py:211 -- Terminated Raylet: ip=<WORKER_IP>, node_id=e14d1dc3d205d02a130d137daf35b25e1b9a2a570ceddea6259d12dd. _check_parent_via_pipe: The parent is dead.
Raylet is terminated. Failed to read Raylet logs at /tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'!
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/process_watcher.py", line 83, in report_raylet_error_logs
    with open(log_path, "r", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'
Raylet is terminated. Failed to read Raylet logs at /tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'!
2025-08-07 06:53:51,677 ERROR (dashboard_agent) process_watcher.py:112 -- Raylet is terminated. Failed to read Raylet logs at /tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'!
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/process_watcher.py", line 83, in report_raylet_error_logs
    with open(log_path, "r", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'
2025-08-07 06:53:51,678 ERROR (dashboard_agent) process_watcher.py:115 -- Raylet is terminated. Failed to read Raylet logs at /tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out: [Errno 2] No such file or directory: '/tmp/ray/session_2025-08-07_06-46-09_598628_232/logs/raylet.out'!
2025-08-07 06:53:51,680 INFO (runtime_env_agent) main.py:232 -- SystemExit! 0

If anyone could shed some light on this issue, I’d appreciate it.

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: 2.48.0
Python version: 3.12.9
OS: linux ubuntu
Cloud/Infrastructure: AWS
Other libs/tools (if relevant): uv

3. What happened vs. what you expected:

Expected: the worker node registerers into the cluster and sends heartbeat
Actual: the worker node begins registration process but gets killed after 30-45 seconds

Dominic_Laflamme · August 7, 2025, 5:52pm

all ports / IP addresses are reachable by both machines.

0-65535 → 0.0.0.0/0

christina · August 7, 2025, 7:48pm

Hello!

The error “health check failed due to missing too many heartbeats” means the head node is not receiving heartbeats from your manually attached worker, so it marks the worker as dead and removes it from the cluster. This happens sometimes when manually attaching nodes to a Ray cluster managed by the autoscaler, especially if the node’s environment or configuration doesn’t match the autoscaler’s expectations, or if there are resource/contention/network/etc issues.

I recommend that you double-check that all firewalls (including OS-level, cloud security groups, and any group policies) allow all TCP traffic between all nodes, and that the correct Python executables are whitelisted. If using a virtual environment, ensure both the base and venv Python executables are allowed. This was the root cause in a nearly identical scenario, where the worker node was marked dead after 30 seconds due to missed heartbeats, and the logs were inaccessible until the correct firewall rules were set for all Python executables involved: Remote worker nodes only alive for 30 seconds - #4 by bananajoe182

According to Ray cluster FAQ, and Ray on-prem cluster guide, you should ensure the worker’s Ray version, Python version, and environment match the head node too. Let me know if those 2 guides help.

If you’re still having this issue please lmk and we can try to debug further!

Dominic_Laflamme · August 7, 2025, 8:35pm

everything spawn in my cluster (head node, auto-scaled workers and manual workers) are all sharing the exact same Docker (everything is run in pre-configured rayproject/ray-gpu dockers), so differences in configuration is highly unlikely, unless the autoscaler magically configures something on the worker node I am not able to detect).

For the ports, all communication that I can test work perfectly fine, everything is wide open (for debugging), so it should be fine. But it’s hard to debug the “random” port the Ray head node opens to talk to the worker.

I’ll debug some more and report back.

Dominic_Laflamme · August 7, 2025, 9:36pm

more info:

[2025-08-07 14:32:39,172 D 83 83] (raylet) worker_pool.cc:1190: Idle workers: 0, idle workers that are eligible to kill: 0, num desired workers : 2
[2025-08-07 14:32:39,363 D 83 83] (raylet) subscriber.cc:377: Long polling request has been replied from 542229e7d1afcd3dd7c6664aed2778eda0a4530e354e268e307b9560
[2025-08-07 14:32:39,363 D 83 83] (raylet) subscriber.cc:358: Make a long polling request to 542229e7d1afcd3dd7c6664aed2778eda0a4530e354e268e307b9560
[2025-08-07 14:32:39,363 I 83 83] (raylet) accessor.cc:784: Received notification for node, IsAlive = 0 node_id=bc994716047879ac6b8154eec36b0b0cbcc35edd3c20162c152bf7b2
[2025-08-07 14:32:39,363 D 83 83] (raylet) node_manager.cc:808: [NodeRemoved] Received callback from node id  node_id=bc994716047879ac6b8154eec36b0b0cbcc35edd3c20162c152bf7b2
[2025-08-07 14:32:39,421 C 83 83] (raylet) node_manager.cc:821: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xea292a) [0x56ba7ceba92a] ray::operator<<()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xea4d19) [0x56ba7cebcd19] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x30878b) [0x56ba7c32078b] ray::raylet::NodeManager::NodeRemoved()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x61d735) [0x56ba7c635735] ray::gcs::NodeInfoAccessor::HandleNotification()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x7aad98) [0x56ba7c7c2d98] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x7a2e57) [0x56ba7c7bae57] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x7a3deb) [0x56ba7c7bbdeb] boost::asio::detail::executor_op<>::do_complete()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xe7ebab) [0x56ba7ce96bab] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xe81139) [0x56ba7ce99139] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0xe81652) [0x56ba7ce99652] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet(+0x20fcef) [0x56ba7c227cef] main

not sure why this is happening, the worker docker is not overloaded…. any ideas?

christina · August 7, 2025, 11:46pm

Hmmm.. can you try doing these and let me know if it changes anything? Set the environment variables like RAY_health_check_period_ms=500, RAY_health_check_timeout_ms=1000, and RAY_health_check_failure_threshold=10 before starting Ray. They mention it a bit in this comment, so I think it might be worth a shot: [Ray cluster] Worker node is disappearing after some seconds · Issue #45179 · ray-project/ray · GitHub

It might be worth a shot, but I will continue to dig and see if I can find anything else out for you

Dominic_Laflamme · August 13, 2025, 8:14pm

Hi, ok I solved it and it was indeed a port issue. The problem is that Ray Serve will attempt to open a random port to your worker node, and in my case even thought the machine itself had all ports opened and accessible, this was not true for the docker running on the machine. speifying --network=host when launching the docker solves this.

I must say though that the error message and logs in Ray could have been better. The error message I was getting was confusing. Also, if there’s any way to change Ray’ Serve’s behavior to not expect all ports to be accessible on the worker would probably be better.

thanks again for your help and prompt response,

Dominic

christina · August 18, 2025, 7:00pm

This is excellent feedback, thank you! It might be very useful for the Ray team if you decided to open up an issue describing this in the GitHub, so we can track it and see if we can improve the error messaging in the future. But I’m glad you figured it out and feel free to reach out anytime if you have any more questions!

Topic		Replies	Views
Remote Worker Nodes die after a few seconds Ray Clusters	5	2091	July 17, 2024
Workers crashes after few seconds automatically Ray Clusters	1	353	March 5, 2025
Head and worked node dies after few seconds Kubernetes	3	1210	March 24, 2023
(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start Ray Clusters	4	2690	May 26, 2022
The heartbeat between the worker and the header has failed Ray Clusters	5	616	July 17, 2024

Worker gets killed unexpectedly

Related topics