Hello!
The error “health check failed due to missing too many heartbeats” means the head node is not receiving heartbeats from your manually attached worker, so it marks the worker as dead and removes it from the cluster. This happens sometimes when manually attaching nodes to a Ray cluster managed by the autoscaler, especially if the node’s environment or configuration doesn’t match the autoscaler’s expectations, or if there are resource/contention/network/etc issues.
I recommend that you double-check that all firewalls (including OS-level, cloud security groups, and any group policies) allow all TCP traffic between all nodes, and that the correct Python executables are whitelisted. If using a virtual environment, ensure both the base and venv Python executables are allowed. This was the root cause in a nearly identical scenario, where the worker node was marked dead after 30 seconds due to missed heartbeats, and the logs were inaccessible until the correct firewall rules were set for all Python executables involved: Remote worker nodes only alive for 30 seconds - #4 by bananajoe182
According to Ray cluster FAQ, and Ray on-prem cluster guide, you should ensure the worker’s Ray version, Python version, and environment match the head node too. Let me know if those 2 guides help.
If you’re still having this issue please lmk and we can try to debug further! ![]()