Replicas can't connect to GPUs

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

What happens

  • I start up a Ray cluster that is enabled with a GPU on each node.
  • I follow the Ray documentation and install the NVIDIA Container Toolkit, adding the required commands to initialization_commands in my config.yaml.
  • I submit my FastAPI script with the Ray CLI, but it says that there is no GPU available.
config.yaml
-----------
cluster_name: minimal
max_workers: 4
upscaling_speed: 1.0
docker:
  image: "rayproject/ray-ml:latest-py38-gpu"
  container_name: "ray_container"
  pull_before_run: True
  run_options:  # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536

idle_timeout_minutes: 5

provider:
    type: gcp
    region: europe-west1
    availability_zone: europe-west1-b
    project_id: bert-training-test

auth:
    ssh_user: ubuntu

available_node_types:
    ray_head_default:
        resources: {"GPU": 1, "CPU": 8}
        node_config:
            machineType: n1-highmem-8
            guestAccelerators: [
              {
                "acceleratorType": "nvidia-tesla-t4",
                "acceleratorCount": 1
              }
              ]
            scheduling:
              onHostMaintenance: TERMINATE
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/ml-images/global/images/c2-deeplearning-pytorch-1-12-cu113-v20220701-debian-10

    ray_worker_small:
        min_workers: 0
        max_workers: 2
        resources: {"GPU": 1, "CPU": 8}
        node_config:
            machineType: n1-highmem-8
            guestAccelerators: [
              {
                "acceleratorType": "nvidia-tesla-t4",
                "acceleratorCount": 1
              }
              ]
            scheduling:
              onHostMaintenance: TERMINATE
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/ml-images/global/images/c2-deeplearning-pytorch-1-12-cu113-v20220701-debian-10
            scheduling:
              - preemptible: true

head_node_type: ray_head_default

file_mounts: {
  "/entity-level-risk": "/Users/ljbails/Repositories/entity-level-risk"
}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: [
  "sudo /opt/deeplearning/install-driver.sh",
  "distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list",
  'grep -l "nvidia.github.io" /etc/apt/sources.list.d/* | grep -vE "/nvidia-container-toolkit.list\$" | xargs sudo rm -rf',
  "sudo apt-get update",
  "sudo apt-get install -y nvidia-docker2",
  "sudo systemctl restart docker"
]

setup_commands: [
  "export CPPFLAGS='-std=c++98'",
  "cd /entity-level-risk && python -m pip install -e '.[deploy]'",
]


head_setup_commands:
  - pip install google-api-python-client==1.7.8

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
my-app.py
-----------

import ray
from ray import serve
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from transformers import AutoModelForTokenClassification, AutoTokenizer
from pydantic import BaseModel
import torch
from torch.nn import functional as F
import pandas as pd
import numpy as np


print("######## DEVICE ########")
print("cuda:0" if torch.cuda.is_available() else "cpu")

app = FastAPI()
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

ray.init(address="auto", namespace="serve")
serve.start(detached=True, http_options={"host": "0.0.0.0"})

@serve.deployment(route_prefix="/nrer",
                  num_replicas=2,
                  ray_actor_options={"num_gpus": 1, "num_cpus": 6})
@serve.ingress(app)
class NRERDeployment:

    def __init__(self):
        ...


NRERDeployment.deploy()

bash
-------

>> ray up config.yaml
>> ray submit config.yaml my-app.py

2022-08-09 10:36:17,569 INFO util.py:335 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
/Users/ljbails/.pyenv/versions/3.9.11/envs/elr/lib/python3.9/site-packages/google/auth/_default.py:81: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/Users/ljbails/.pyenv/versions/3.9.11/envs/elr/lib/python3.9/site-packages/google/auth/_default.py:81: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/Users/ljbails/.pyenv/versions/3.9.11/envs/elr/lib/python3.9/site-packages/google/auth/_default.py:81: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
Fetched IP: 104.155.93.233
Shared connection to 104.155.93.233 closed.
Shared connection to 104.155.93.233 closed.
2022-08-09 10:36:26,079 INFO util.py:335 -- setting max workers for head node type to 0
Fetched IP: 104.155.93.233
Shared connection to 104.155.93.233 closed.
######## DEVICE ########
cuda:0     
(ServeController pid=775) INFO 2022-08-09 02:36:36,171 controller 775 checkpoint_path.py:17 - Using RayInternalKVStore for controller checkpoint and recovery.
(ServeController pid=775) INFO 2022-08-09 02:36:36,274 controller 775 http_state.py:112 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:10.132.0.18-0' on node 'node:10.132.0.18-0' listening on '0.0.0.0:8000'
(HTTPProxyActor pid=807) INFO:     Started server process [807]
(ServeController pid=775) INFO 2022-08-09 02:36:40,433 controller 775 deployment_state.py:1216 - Adding 2 replicas to deployment 'NRERDeployment'.
(scheduler +13s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +13s) Adding 1 nodes of type ray_worker_small.
(ServeController pid=775) WARNING 2022-08-09 02:37:10,467 controller 775 deployment_state.py:1453 - Deployment 'NRERDeployment' has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {'CPU': 6, 'GPU': 1}, resources available: {'CPU': 2.0}.
(ServeController pid=775) WARNING 2022-08-09 02:37:10,467 controller 775 deployment_state.py:1453 - Deployment 'NRERDeployment' has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {'CPU': 6, 'GPU': 1}, resources available: {'CPU': 2.0}.
(ServeController pid=775) WARNING 2022-08-09 02:37:10,467 controller 775 deployment_state.py:1453 - Deployment 'NRERDeployment' has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {'CPU': 6, 'GPU': 1}, resources available: {'CPU': 2.0}.
...

Note that torch.cuda.is_available() returned True, but it says resources available: {'CPU': 2.0}
When I change num_replicas to 1, rather than 2, it works fine.

Any idea what I’m doing wrong?

It looks like your cluster probably only has 1 node with 6 CPUs and 1 GPU. That would explain why there’s only 2 CPUs remaining after 1 replica gets scheduled. Any chance that might be the case?

I had a look at the Compute Engine UI for my GCP project and it was showing two instances. One was called ray-minimal-head and the other ray-minimal-worker (or something to that effect).

I was able to SSH into both of those instances, so they are both functional. However, on the first SSH into the worker node, it asked if I wanted to install the NVIDIA driver (which it usually does when using on of the DL images on GCP). Do you think perhaps this is getting in the way of ray correctly setting up the worker node(s)?

The head node did not prompt me to install the NVIDIA driver.

ps. I assume initialization_commands is meant to run on each node?

Initialization commands indeed runs on each node.
Would you mind doing the following for additional debug info:

After spinning up the cluster, could you start a shell on the head node (ray attach config.yaml) and then execute ray status?
This will display the Ray scheduler’s view of each node’s resource capacity. I’d like to see whether that indicates the presence of GPUs.

======== Autoscaler status: 2022-08-09 12:32:34.463860 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
Pending:
 10.132.0.27: ray_worker_small, waiting-for-ssh
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 6.0/8.0 CPU
 1.0/1.0 GPU
 0.0/1.0 accelerator_type:T4
 0.00/30.053 GiB memory
 0.00/15.026 GiB object_store_memory

Demands:
 {'CPU': 6.0, 'GPU': 1.0}: 1+ pending tasks/actors

The worker node appears to be up looking at the Compute Engine UI:

(ServeController pid=924) WARNING 2022-08-09 12:38:46,251 controller 924 deployment_state.py:1453 - Deployment 'NRERDeployment' has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {'CPU': 6, 'GPU': 1}, resources available: {'CPU': 2.0}.

Thanks for that detail – what we are seeing is that the worker instance is being provisioned but for some reason the head node is hanging getting an SSH connection to the worker instance.

The next step is to take a look at the contents of the file
/tmp/ray/session_latest/monitor.*

You can either get at those logs by SSHing into the head node and opening the files, or by tailing the files with ray monitor config.yaml.

>> ray monitor config.yaml

==> /tmp/ray/session_latest/logs/monitor.log <==
 0.0/8.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:T4
 0.00/30.021 GiB memory
 0.00/15.011 GiB object_store_memory

Demands:
 (no resource demands)
2022-08-10 00:33:40,252 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/bert-training-test/zones/europe-west1-b/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29%29+AND+%28labels.ray-cluster-name+%3D+minimal%29&alt=json
2022-08-10 00:33:40,333 INFO autoscaler.py:330 -- 
======== Autoscaler status: 2022-08-10 00:33:40.333000 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
Pending:
 10.132.0.31: ray_worker_small, waiting-for-ssh
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:T4
 0.00/30.021 GiB memory
 0.00/15.011 GiB object_store_memory

Demands:
 (no resource demands)
2022-08-10 00:33:45,376 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/bert-training-test/zones/europe-west1-b/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29%29+AND+%28labels.ray-cluster-name+%3D+minimal%29&alt=json
2022-08-10 00:33:45,552 INFO autoscaler.py:330 -- 
======== Autoscaler status: 2022-08-10 00:33:45.551947 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
Pending:
 10.132.0.31: ray_worker_small, waiting-for-ssh
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:T4
 0.00/30.021 GiB memory
 0.00/15.011 GiB object_store_memory

Demands:
 (no resource demands)
2022-08-10 00:33:50,597 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/bert-training-test/zones/europe-west1-b/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29%29+AND+%28labels.ray-cluster-name+%3D+minimal%29&alt=json
2022-08-10 00:33:50,655 INFO autoscaler.py:330 -- 
======== Autoscaler status: 2022-08-10 00:33:50.655660 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
Pending:
 10.132.0.31: ray_worker_small, waiting-for-ssh
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:T4
 0.00/30.021 GiB memory
 0.00/15.011 GiB object_store_memory

Demands:
 (no resource demands)
2022-08-10 00:33:55,695 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/bert-training-test/zones/europe-west1-b/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29%29+AND+%28labels.ray-cluster-name+%3D+minimal%29&alt=json
2022-08-10 00:33:55,775 INFO autoscaler.py:330 -- 
======== Autoscaler status: 2022-08-10 00:33:55.775131 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
Pending:
 10.132.0.31: ray_worker_small, waiting-for-ssh
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:T4
 0.00/30.021 GiB memory
 0.00/15.011 GiB object_store_memory

Demands:
 (no resource demands)

==> /tmp/ray/session_latest/logs/monitor.out <==
2022-08-10 00:30:14,745 INFO updater.py:325 -- New status: waiting-for-ssh
2022-08-10 00:30:14,746 INFO updater.py:262 -- [1/7] Waiting for SSH to become available
2022-08-10 00:30:14,746 INFO updater.py:267 -- Running `uptime` as a test.
2022-08-10 00:30:14,746 INFO command_runner.py:394 -- Fetched IP: 10.132.0.31
2022-08-10 00:30:14,746 INFO log_timer.py:25 -- NodeUpdater: ray-minimal-worker-071e5405-compute: Got IP  [LogTimer=0ms]
2022-08-10 00:30:14,747 VINFO command_runner.py:552 -- Running `uptime`
2022-08-10 00:30:14,747 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.132.0.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-08-10 00:30:17,800 INFO updater.py:313 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2022-08-10 00:30:22,805 VINFO command_runner.py:552 -- Running `uptime`
2022-08-10 00:30:22,805 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.132.0.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-08-10 00:30:22,816 INFO updater.py:313 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2022-08-10 00:30:27,817 VINFO command_runner.py:552 -- Running `uptime`
2022-08-10 00:30:27,817 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.132.0.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-08-10 00:30:27,828 INFO updater.py:313 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2022-08-10 00:30:32,833 VINFO command_runner.py:552 -- Running `uptime`
2022-08-10 00:30:32,834 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.132.0.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-08-10 00:30:32,845 INFO updater.py:313 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2022-08-10 00:30:37,850 VINFO command_runner.py:552 -- Running `uptime`
2022-08-10 00:30:37,851 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.132.0.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-08-10 00:30:37,862 INFO updater.py:313 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2022-08-10 00:30:42,867 VINFO command_runner.py:552 -- Running `uptime`
2022-08-10 00:30:42,868 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.132.0.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-08-10 00:30:42,879 INFO updater.py:313 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2022-08-10 00:30:47,884 VINFO command_runner.py:552 -- Running `uptime`
2022-08-10 00:30:47,885 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.132.0.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-08-10 00:30:47,895 INFO updater.py:313 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2022-08-10 00:30:52,899 VINFO command_runner.py:552 -- Running `uptime`
2022-08-10 00:30:52,899 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.132.0.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`

This VM requires Nvidia drivers to function correctly.   Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n]

I managed to get a CPU-only example going with 3 worker nodes after changing the host image to something that didn’t have the NVIDIA driver install prompt on first SSH into each node. So perhaps this is causing the hanging.

I managed to get a GPU-enabled example going with 3 worker nodes. The NVIDIA drivers were installed in initialization_commands instead of with the automatic prompt on the first SSH.

Ok, I see – it was the NVIDIA installation prompt that caused the SSH command to hang.

Just to clarify, is it right that you were able to resolve the issue by specifying a command to install NVIDIA drivers in the initialization_commands?

Yes I briefly got a multi-node GPU-enabled cluster going. It seems GCPs CUDA-enabled host images like projects/ml-images/global/images/c2-deeplearning-pytorch-1-12-cu113-v20220701-debian-10 will cause the node setup to hang.

I switched to a host image that does not have the prompt, and instead my initialization_commands section installed the required NVIDIA drivers and the NVIDIA Container Toolkit.

initialization_commands: [
  "sudo /tmp/ray_tmp_mount/<name_of_cluster>/<mounted_disk>/install-driver.sh",
  "distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --yes --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list",
  'grep -l "nvidia.github.io" /etc/apt/sources.list.d/* | grep -vE "/nvidia-container-toolkit.list\$" | xargs sudo rm -rf',
  "sudo apt-get update",
  "sudo apt-get install -y nvidia-docker2",
  "sudo systemctl restart docker",
]

I have since started running into a recurring worker node startup failure, but perhaps that’s a GCP issue. Seems like it just does not like setting up worker nodes.

==> /tmp/ray/session_latest/logs/monitor.out <==
2022-08-11 01:23:42,997 ERR updater.py:159 -- New status: update-failed
2022-08-11 01:23:43,003 ERR updater.py:161 -- !!!
2022-08-11 01:23:43,008 VERR updater.py:169 -- {'message': 'SSH command failed.'}
2022-08-11 01:23:43,015 ERR updater.py:171 -- SSH command failed.
2022-08-11 01:23:43,016 ERR updater.py:173 -- !!!