I’m new to Ray, so I’m trying to wrap my head around issues with missing heartbeats. There seems to be a lot of people complaining about this, but the solutions don’t work for me.
The first thing I did was to set up the head node. Specifically, this is on a Windows 10 machine, but is running under Ubuntu in WSL. I run with this command:
ray start --head --port=50000 --num-cpus=5
This is working fine. Next thing I did was to take another Windows 10 machine and set it up as worker node. Once again, using Ubuntu in WSL. The reason I’m using Ubuntu is apparently Ray doesn’t support multi-node clusters in Windows. Or maybe it does, but could be buggy? Regardless, running on Ubuntu/WSL is working fine. I think that proves the default heartbeat config is working.
Finally I wanted a third machine, so I installed a similar setup on my laptop. The difference is that I just recently updated to Windows 11. Other than that, I don’t see a difference. Once again, Ubuntu in WSL. This time, however, after about 30 seconds, the node dies with:
Unexpected termination: health check failed due to missing too many heartbeats
I noticed there’s a different network adapter on this machine compared to the other machines. Something about a hyper-v firewall. And apparently Windows 11 has introduced this. I’ve tried all sorts of things to disable firewall or allow connections through, but to no avail. It’s letting some things through, as the node will show up in the dashboard as active for the first 30 seconds.
Since there was a RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1
option, I decided to try that in order to bypass WSL. That works fine. So, clearly I have some kind of issue related to WSL on this machine that doesn’t happen with the Windows 10 machines.
I’m wondering if anyone has any tips to configure my system? I’m assuming this is firewall related.
For the record, I’ve referenced Hyper-V firewall | Microsoft Learn to do the following:
Set-NetFirewallHyperVVMSetting -Name '{40E0AC32-46A5-438A-A0B2-2B479E8F2E90}' -DefaultInboundAction Allow
and
Set-NetFirewallHyperVVMSetting -Name '{40E0AC32-46A5-438A-A0B2-2B479E8F2E90}' -Enabled False
Neither seems to help.