How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi,
I am using Ray in GCP Vertex AI. I created the Ray cluster using vertex Ray SDK as mentioned in the GCP docs. Once I have private repositories I build a custom docker image with:
- Python 3.10
- Ray 2.33
- Tensorflow, etc
- private packages
The head and worker nodes are initialized using this docker image that is saved in GCP Artifact Registry.
When I launch a cluster with GPU nodes like mentioned below, I see that the node with the GPU dies with the following message on the events_event_GCS.log
:
import ray
import vertex_ray
from google.cloud import aiplatform
from vertex_ray import Resources
# Define a default CPU cluster, machine_type is n1-standard-16, 1 head node and 1 worker node
head_node_type = Resources()
worker_node_types = [Resources()]
# Or define a GPU cluster.
head_node_type = Resources(
machine_type="n1-standard-16",
node_count=1,
custom_image="MY_CUSTOM_IMAGE"
)
worker_node_types = [Resources(
machine_type="n1-standard-16",
node_count=2, # Must be >= 1
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
custom_image="MY_CUSTOM_IMAGE",
)]
aiplatform.init()
# Initialize Vertex AI to retrieve projects for downstream operations.
# Create the Ray cluster on Vertex AI
CLUSTER_RESOURCE_NAME = vertex_ray.create_ray_cluster(
head_node_type=head_node_type,
network=NETWORK, #Optional
worker_node_types=worker_node_types,
python_version="3.10", # Optional
ray_version="2.33", # Optional
cluster_name=CLUSTER_NAME, # Optional
service_account=SERVICE_ACCOUNT, # Optional
enable_metrics_collection=True, # Optional. Enable metrics collection for monitoring.
labels=LABELS, # Optional.
)
After the node dies, it retries to relaunch it a couple of times until it dies completely. I can see multiple entrances of the following error:
severity: ERROR
source: GCS
hostname: gke-vertex-persistent-xxxxxxxxx
pid: 15
eventId: xxxxxxx
The node with node id: XXXX and address: XXXX and node name: XXXXX has been marked dead because the detector has missed too many heartbeats from it. This can happen when a
(1) raylet crashes unexpectedly (OOM, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload
I already tried with several GCP instances and different GPU devices but without success.
Does anyone have some hint about what is going on?
Thanks,