Ray (Tune) v2.8 - Instability with workers on GCP

denmarc · March 7, 2024, 2:59pm

I have recently updated my code to depend on Ray v2.8, and ever since, I am having difficulties running a tuning job on GCP. Sometimes, all workers will launch but the head node appears not to communicate with them, as they remain unused; other times, some workers launch while others don’t; most of the times, no workers will launch at all.

As for exceptions, sometimes I get:

The node with node id: XXX and address: 10.138.0.16 and node name: 10.138.0.16 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload.

sometimes SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC or SSL: WRONG_VERSION_NUMBER, and other times, a message stating that the worker node is using a previous version of Ray - the one I installed in the image I used as a template for the workers instances (even though, as far as I understood of the process, package files would be synced from the head node first thing when a worker node is launched). This behaviour seems to be totally random.

If I had to take a chance, I would say the issue relates to communication between the head and the workers nodes, as that syncing seems to be skipped eventually, which would cause the other issues.

When depending on v2.5, even though the first trial would always fail (then restart and run as expected), the tuning job would run and complete with no issues.

I know my complaint is quite generic, but I find it quite difficult to come up with a reproducible code, so I would rather like to know if these are known and potentially solvable issues.

Topic		Replies	Views
Tune cannot sync to GCS Ray Clusters	1	583	October 19, 2021
GCP Cluster Worker Nodes fail to Initialize Ray Clusters	2	274	March 8, 2024
[Tune] Error when using docker containers and Sync Ray Tune	9	1072	March 16, 2021
RayTune cluster not distributing load correctly? Ray Tune	4	175	November 14, 2023
"ray up yaml" cannot connect to worker node without error info Ray Tune	1	337	November 30, 2021

Ray (Tune) v2.8 - Instability with workers on GCP

Related Topics