Ray cluster Vertex AI: raylet has lagging heartbeats due to slow network or busy workload

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,

I am using Ray in GCP Vertex AI. I created the Ray cluster using vertex Ray SDK as mentioned in the GCP docs. Once I have private repositories I build a custom docker image with:

  • Python 3.10
  • Ray 2.33
  • Tensorflow, etc
  • private packages

The head and worker nodes are initialized using this docker image that is saved in GCP Artifact Registry.

When I launch a cluster with GPU nodes like mentioned below, I see that the node with the GPU dies with the following message on the events_event_GCS.log:

import ray
import vertex_ray
from google.cloud import aiplatform
from vertex_ray import Resources

# Define a default CPU cluster, machine_type is n1-standard-16, 1 head node and 1 worker node
head_node_type = Resources()
worker_node_types = [Resources()]

# Or define a GPU cluster.
head_node_type = Resources(
  machine_type="n1-standard-16",
  node_count=1,
  custom_image="MY_CUSTOM_IMAGE"
)

worker_node_types = [Resources(
  machine_type="n1-standard-16",
  node_count=2,  # Must be >= 1
  accelerator_type="NVIDIA_TESLA_T4",
  accelerator_count=1,
  custom_image="MY_CUSTOM_IMAGE", 
)]

aiplatform.init()
# Initialize Vertex AI to retrieve projects for downstream operations.
# Create the Ray cluster on Vertex AI
CLUSTER_RESOURCE_NAME = vertex_ray.create_ray_cluster(
  head_node_type=head_node_type,
  network=NETWORK, #Optional
  worker_node_types=worker_node_types,
  python_version="3.10",  # Optional
  ray_version="2.33",  # Optional
  cluster_name=CLUSTER_NAME, # Optional
  service_account=SERVICE_ACCOUNT,  # Optional
  enable_metrics_collection=True,  # Optional. Enable metrics collection for monitoring.
  labels=LABELS,  # Optional.

)

After the node dies, it retries to relaunch it a couple of times until it dies completely. I can see multiple entrances of the following error:

severity: ERROR
source: GCS
hostname: gke-vertex-persistent-xxxxxxxxx
pid: 15
eventId: xxxxxxx
The node with node id: XXXX and address: XXXX and node name: XXXXX has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	
    (1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload

I already tried with several GCP instances and different GPU devices but without success.

Does anyone have some hint about what is going on?

Thanks,

Does the same thing occur with the default ray version that Vertex ships with?

Not really. If I set the default images from vertex ai it works.
I was able to bypass this error if I used the same docker image in both head/worker nodes.
I thought that was not necessary because I don’t need a GPU-based image in the head node, only for the worker node where I pretend to run distributed training.

Do you have any hint as to why this happens?

In the CPU image, I start from python:3.10-slim and install ray 2.33 with poetry.
In the GPU image, I start from nvidia/cuda:11.8.0-runtime-ubuntu22.04 and install ray 2.33 with poetry as well.