Ray cluster Vertex AI: raylet has lagging heartbeats due to slow network or busy workload

jfecunha · October 14, 2024, 4:52pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi,

I am using Ray in GCP Vertex AI. I created the Ray cluster using vertex Ray SDK as mentioned in the GCP docs. Once I have private repositories I build a custom docker image with:

Python 3.10
Ray 2.33
Tensorflow, etc
private packages

The head and worker nodes are initialized using this docker image that is saved in GCP Artifact Registry.

When I launch a cluster with GPU nodes like mentioned below, I see that the node with the GPU dies with the following message on the events_event_GCS.log:

import ray
import vertex_ray
from google.cloud import aiplatform
from vertex_ray import Resources

# Define a default CPU cluster, machine_type is n1-standard-16, 1 head node and 1 worker node
head_node_type = Resources()
worker_node_types = [Resources()]

# Or define a GPU cluster.
head_node_type = Resources(
  machine_type="n1-standard-16",
  node_count=1,
  custom_image="MY_CUSTOM_IMAGE"
)

worker_node_types = [Resources(
  machine_type="n1-standard-16",
  node_count=2,  # Must be >= 1
  accelerator_type="NVIDIA_TESLA_T4",
  accelerator_count=1,
  custom_image="MY_CUSTOM_IMAGE", 
)]

aiplatform.init()
# Initialize Vertex AI to retrieve projects for downstream operations.
# Create the Ray cluster on Vertex AI
CLUSTER_RESOURCE_NAME = vertex_ray.create_ray_cluster(
  head_node_type=head_node_type,
  network=NETWORK, #Optional
  worker_node_types=worker_node_types,
  python_version="3.10",  # Optional
  ray_version="2.33",  # Optional
  cluster_name=CLUSTER_NAME, # Optional
  service_account=SERVICE_ACCOUNT,  # Optional
  enable_metrics_collection=True,  # Optional. Enable metrics collection for monitoring.
  labels=LABELS,  # Optional.

)

After the node dies, it retries to relaunch it a couple of times until it dies completely. I can see multiple entrances of the following error:

severity: ERROR
source: GCS
hostname: gke-vertex-persistent-xxxxxxxxx
pid: 15
eventId: xxxxxxx
The node with node id: XXXX and address: XXXX and node name: XXXXX has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	
    (1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload

I already tried with several GCP instances and different GPU devices but without success.

Does anyone have some hint about what is going on?

Thanks,

Sam_Chan · October 16, 2024, 11:34pm

Does the same thing occur with the default ray version that Vertex ships with?

jfecunha · October 17, 2024, 9:20am

Not really. If I set the default images from vertex ai it works.
I was able to bypass this error if I used the same docker image in both head/worker nodes.
I thought that was not necessary because I don’t need a GPU-based image in the head node, only for the worker node where I pretend to run distributed training.

Do you have any hint as to why this happens?

In the CPU image, I start from python:3.10-slim and install ray 2.33 with poetry.
In the GPU image, I start from nvidia/cuda:11.8.0-runtime-ubuntu22.04 and install ray 2.33 with poetry as well.

Topic		Replies	Views
Replicas can't connect to GPUs Ray Serve	9	1626	August 11, 2022
Logging in to GCP custom docker image Ray Clusters	0	217	February 17, 2024
[GCP] Ray Cluster on GCP scales up very slowly Ray Clusters	1	628	December 14, 2021
Problems lauching gcp cluster Ray Core	4	727	July 7, 2022
Remote worker nodes only alive for 30 seconds Ray Clusters	7	1599	April 24, 2025

Ray cluster Vertex AI: raylet has lagging heartbeats due to slow network or busy workload

Related topics