No GPUs available when using slurm-template.sh to launch SLURM Ray cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am using the the slurm-template.sh and launch.py script to run on a node with GPUs but Ray and PyTorch are unable to see all the gpus. I am not sure what I am doing wrong, if someone could point out the error I am making, I would be super grateful.

This is the output of my script:

IP Head: 128.223.192.150:6379
STARTING HEAD at n0150
2024-06-08 07:09:49,200 INFO usage_lib.py:449 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable t$
2024-06-08 07:09:49,200 INFO scripts.py:744 -- Local node IP: 128.223.192.150
2024-06-08 07:09:51,160 SUCC scripts.py:781 -- --------------------
2024-06-08 07:09:51,161 SUCC scripts.py:782 -- Ray runtime started.
2024-06-08 07:09:51,161 SUCC scripts.py:783 -- --------------------
2024-06-08 07:09:51,161 INFO scripts.py:785 -- Next steps
2024-06-08 07:09:51,161 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-06-08 07:09:51,161 INFO scripts.py:791 --   ray start --address='128.223.192.150:6379'
2024-06-08 07:09:51,161 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-06-08 07:09:51,161 INFO scripts.py:802 -- import ray
2024-06-08 07:09:51,161 INFO scripts.py:803 -- ray.init(_node_ip_address='128.223.192.150')
2024-06-08 07:09:51,161 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-06-08 07:09:51,161 INFO scripts.py:835 --   ray stop
2024-06-08 07:09:51,161 INFO scripts.py:838 -- To view the status of the cluster, use
2024-06-08 07:09:51,161 INFO scripts.py:839 --   ray status
2024-06-08 07:09:51,161 INFO scripts.py:952 -- --block
2024-06-08 07:09:51,161 INFO scripts.py:953 -- This command will now block forever until terminated by a signal.
2024-06-08 07:09:51,161 INFO scripts.py:956 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be$
STARTING WORKER 1 at n0151
[2024-06-08 07:10:18,149 I 2890471 2890471] global_state_accessor.cc:432: This node has an IP address of 128.223.192.151, but we cannot find a local Raylet with the same address. This can h$
2024-06-08 07:10:18,045 INFO scripts.py:926 -- Local node IP: 128.223.192.151
2024-06-08 07:10:18,163 SUCC scripts.py:939 -- --------------------
2024-06-08 07:10:18,163 SUCC scripts.py:940 -- Ray runtime started.
2024-06-08 07:10:18,163 SUCC scripts.py:941 -- --------------------
2024-06-08 07:10:18,163 INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-06-08 07:10:18,163 INFO scripts.py:944 --   ray stop
2024-06-08 07:10:18,164 INFO scripts.py:952 -- --block
2024-06-08 07:10:18,164 INFO scripts.py:953 -- This command will now block forever until terminated by a signal.
2024-06-08 07:10:18,164 INFO scripts.py:956 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be$
2024-06-08 07:10:35,089 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 128.223.192.150:6379...
2024-06-08 07:10:35,098 INFO worker.py:1724 -- Connected to Ray cluster.
nvidia-smi output:
Sat Jun  8 07:10:28 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15	   CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:E8:00.0 Off |                   On |
| N/A   31C    P0             44W /  300W |	 87MiB /  81920MiB |     N/A	  Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  |   00000000:E9:00.0 Off |                   On |
| N/A   33C    P0             42W /  300W |	 87MiB /  81920MiB |     N/A	  Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          On  |   00000000:EA:00.0 Off |                   On |
| N/A   33C    P0             44W /  300W |	 87MiB /  81920MiB |     N/A	  Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|	 Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    7   0   0  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    8   0   1  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    9   0   2  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   10   0   3  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   11   0   4  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   12   0   5  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   6  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    7   0   0  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    8   0   1  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    9   0   2  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
.
.
.
.
.
+------------------+----------------------------------+-----------+-----------------------+
|  2   11   0   4  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2   12   0   5  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2   13   0   6  |              12MiB /  9728MiB    | 14	0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage	  |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

lscpu output:
('Architecture:        x86_64\nCPU op-mode(s):      32-bit, 64-bit\nByte Order:          Little Endian\nCPU(s):              48\nOn-line CPU(s) list: 0-47\nThread(s) per core:  1\nCore(s) p$
CUDA_HOME: None
CUDA_PATH: /packages/cuda/11.5.1
CUDA_INSTALL_PATH: /packages/cuda/11.5.1
SLURM_GPUS_ON_NODE: 21
CUDA_ROOT: /packages/cuda/11.5.1
SLURM_JOB_GPUS: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
CUDA_VISIBLE_DEVICES: MIG-2f983c36-8705-5f09-a836-184e183fad5f,MIG-282f5f18-f948-5905-891a-63c3dbf4a84a,MIG-70232a77-d181-5144-adba-c568a2c39079,MIG-1d8d9e93-f007-5dc7-b145-81a685f7e303,MIG$
CUDA_SDK: /packages/cuda/11.5.1/samples
Number of GPUs from torch: 1
cuda
Ray gpus: []
Ray cluster resources: {'CPU': 96.0, 'memory': 731023510529.0, 'node:128.223.192.151': 1.0, 'object_store_memory': 317581504511.0, 'GPU': 2.0, 'accelerator_type:A100': 2.0, 'node:__internal$
squares: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 136$
pids: {3310982, 3310863, 3310623, 3310628, 3310630, 3310632, 3310633, 3310634, 3310635, 3310636, 3310637, 3310638, 3310639, 3310640, 3310641, 3310642, 3310643, 3310644, 3310908, 3310933, 33$
hostnames: {'n0150.talapas.uoregon.edu'}, len: 1

This is the python file I am running:

import torch
import numpy as np
import ray
import os
import socket
import subprocess

def run_nvidia_smi():
    # Command to run nvidia-smi
    command = ["nvidia-smi"]

    # Open a subprocess and run the command
    with subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True) as proc:
        output, errors = proc.communicate()

    # Print the outputs
    if proc.returncode == 0:
        print("nvidia-smi output:")
        print(output)
    else:
        print("Error running nvidia-smi:")
        print(errors)

# Call the function
run_nvidia_smi()

# Open a subprocess and run the command
with subprocess.Popen(["lscpu"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True) as proc:
    print(f'lscpu output:\n{proc.communicate()}')
    
# Get a specific environment variable
cuda_home = os.getenv('CUDA_HOME')
print(f"CUDA_HOME: {cuda_home}")

# Print all environment variables and filter for GPU-related
for key, value in os.environ.items():
    if 'CUDA' in key or 'GPU' in key or 'NVIDIA' in key:
        print(f"{key}: {value}")


print(f'Number of GPUs from torch: {torch.cuda.device_count()}')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)


ray.init(address='auto')
print(f'Ray gpus: {ray.get_gpu_ids()}')
# print(f'{ray.cluster_resources()['GPU']=} GPUs available in Ray cluster')
@ray.remote
def square(x):
    return x * x, os.getpid(), socket.gethostname()

print(f'Ray cluster resources: {ray.cluster_resources()}')
num_cpus_in_ray = int(ray.cluster_resources()['CPU'])

# Launch four parallel square tasks.
futures = [square.remote(i) for i in range(num_cpus_in_ray)]

squares, pids, hostnames = [list(l) for l in zip(*ray.get(futures))]

print(f'squares: {squares}\npids: {set(pids)}, len: {len(set(pids))}\nhostnames: {set(hostnames)}, len: {len(set(hostnames))}')

This is the sbatch template I am using:

#!/bin/bash

# THIS FILE IS GENERATED BY AUTOMATION SCRIPT! PLEASE REFER TO ORIGINAL SCRIPT!
# THIS FILE IS A TEMPLATE AND IT SHOULD NOT BE DEPLOYED TO PRODUCTION!

#SBATCH --partition={{PARTITION_NAME}}
#SBATCH --job-name={{JOB_NAME}}
#SBATCH --output={{JOB_NAME}}.log
{{GIVEN_NODE}}

### This script works for any number of nodes, Ray will find and manage all resources
#SBATCH --nodes={{NUM_NODES}}
#SBATCH --exclusive

### Give all resources to a single Ray task, ray can manage the resources internally
#SBATCH --ntasks-per-node=1
###### --gpus-per-task={{NUM_GPUS_PER_NODE}}
### See this stack for gpu info: https://stackoverflow.com/questions/67091056/gpu-allocation-in-slurm-gres-vs-gpus-per-task-and-mpirun-vs-srun
#SBATCH --gres=gpu:6

#SBATCH --time={{DAYS}}-00:00:00     ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --account=XXX  ### Account used for job submission

#SBATCH --mail-type=ALL


# Load modules or your own conda environment here
# module load pytorch/v1.4.0-gpu
# conda activate {{CONDA_ENV}}
module load cuda/11.5.1
{{LOAD_ENV}}

################# DON NOT CHANGE THINGS HERE UNLESS YOU KNOW WHAT YOU ARE DOING ###############
# This script is a modification to the implementation suggest by gregSchwartz18 here:
# https://github.com/ray-project/ray/issues/826#issuecomment-522116599
redis_password=$(uuidgen)
export redis_password

nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=($nodes)

node_1=${nodes_array[0]}
ip=$(srun --nodes=1 --ntasks=1 -w $node_1 hostname --ip-address) # making redis-address

if [[ $ip == *" "* ]]; then
  IFS=' ' read -ra ADDR <<<"$ip"
  if [[ ${#ADDR[0]} > 16 ]]; then
    ip=${ADDR[1]}
  else
    ip=${ADDR[0]}
  fi
  echo "We detect space in ip! You are using IPV6 address. We split the IPV4 address as $ip"
fi

port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "STARTING HEAD at $node_1"
# srun --nodes=1 --ntasks=1 -w $node_1 start-head.sh $ip $redis_password &
srun --nodes=1 --ntasks=1 -w $node_1 \
  ray start --head --node-ip-address=$ip --port=6379 --redis-password=$redis_password --block &
sleep 30

worker_num=$(($SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= $worker_num; i++)); do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --nodes=1 --ntasks=1 -w $node_i ray start --address $ip_head --redis-password=$redis_password --block &
  sleep 5
done

##############################################################################################

#### call your code below
{{COMMAND_PLACEHOLDER}} {{COMMAND_SUFFIX}}

I am using the standard launch.py available here.