Worker gets killed unexpectedly

christina · August 7, 2025, 7:48pm

Hello!

The error “health check failed due to missing too many heartbeats” means the head node is not receiving heartbeats from your manually attached worker, so it marks the worker as dead and removes it from the cluster. This happens sometimes when manually attaching nodes to a Ray cluster managed by the autoscaler, especially if the node’s environment or configuration doesn’t match the autoscaler’s expectations, or if there are resource/contention/network/etc issues.

I recommend that you double-check that all firewalls (including OS-level, cloud security groups, and any group policies) allow all TCP traffic between all nodes, and that the correct Python executables are whitelisted. If using a virtual environment, ensure both the base and venv Python executables are allowed. This was the root cause in a nearly identical scenario, where the worker node was marked dead after 30 seconds due to missed heartbeats, and the logs were inaccessible until the correct firewall rules were set for all Python executables involved: Remote worker nodes only alive for 30 seconds - #4 by bananajoe182

According to Ray cluster FAQ, and Ray on-prem cluster guide, you should ensure the worker’s Ray version, Python version, and environment match the head node too. Let me know if those 2 guides help.

If you’re still having this issue please lmk and we can try to debug further!

Topic		Replies	Views
Remote Worker Nodes die after a few seconds Ray Clusters	5	2165	July 17, 2024
Workers crashes after few seconds automatically Ray Clusters	1	364	March 5, 2025
Head and worked node dies after few seconds Kubernetes	3	1242	March 24, 2023
(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start Ray Clusters	4	2835	May 26, 2022
The heartbeat between the worker and the header has failed Ray Clusters	5	717	July 17, 2024

Worker gets killed unexpectedly

Related topics