GPU configuration with Cluster Launcher + On-premise Cluster

fordaz · February 23, 2024, 4:43am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Does anyone know how to configure GPUs when using the Cluster Autolauncher with an on-prem cluster ?

The goal is to use the Cluster Autolauncher with a few Lambdalabs cloud instances to run a DL training job.

These are the steps I’m following (based on this doc):

Launch (gpu_1x_a10) Lambdalabs instances
Use the following config file
Run ray up lambdalabs-launcher-config.yaml
Run ray dashboard lambdalabs-launcher-config.yaml
Run RAY_ADDRESS='http://localhost:8265' ray job submit --working-dir . -- python check_gpu_ray.py

The python script is simple:

import torch
import ray; 
ray.init(); 
print(torch.cuda.is_available())

The output is False.

Also, the when accessing the dashboard, the GPU column in the Cluster tab has N/A.

I tried adding this to the config file:

available_node_types:
    ray.head.default:
        resources: {"CPU":1, "GPU":1}

But I got this error “The field available_node_types is not supported for on-premise clusters.”

Manually Installing and running the ray scripts on the host work fine and the GPU are detected, the issue is when using the docker containers launched via the autolauncher.

I’ve looked at the cluster configuration spec, but the configs seem to be supported for other cloud environments, not for on-prem.

Any help will be appreciated, thanks.

fordaz · February 28, 2024, 4:09pm

Quick update. I solved the issue adding the --gpus all to the run_options section in the config file. For example:

docker:
  container_name: ray_container
  image: rayproject/ray-ml:latest-gpu
  pull_before_run: true
  run_options:
  - --ulimit nofile=65536:65536
  - --gpus all

Topic		Replies	Views
On-premise cluster: different worker node types Ray Clusters	5	909	June 16, 2023
Ray cannot detect GPU on databricks cluster Ray Clusters	2	70	October 7, 2024
Ray cluster details doesn't show requested number of gpus Kubernetes	3	193	June 19, 2024
Ray on Private Cluster and Pytorch Lightning Ray Clusters	3	665	March 16, 2023
Multiple GPU head node on GCP Ray Clusters	3	583	April 25, 2022

GPU configuration with Cluster Launcher + On-premise Cluster

Related topics