GPU configuration with Cluster Launcher + On-premise Cluster

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Does anyone know how to configure GPUs when using the Cluster Autolauncher with an on-prem cluster ?

The goal is to use the Cluster Autolauncher with a few Lambdalabs cloud instances to run a DL training job.

These are the steps I’m following (based on this doc):

  • Launch (gpu_1x_a10) Lambdalabs instances
  • Use the following config file
  • Run ray up lambdalabs-launcher-config.yaml
  • Run ray dashboard lambdalabs-launcher-config.yaml
  • Run RAY_ADDRESS='http://localhost:8265' ray job submit --working-dir . -- python check_gpu_ray.py

The python script is simple:

import torch
import ray; 
ray.init(); 
print(torch.cuda.is_available())

The output is False.

Also, the when accessing the dashboard, the GPU column in the Cluster tab has N/A.

I tried adding this to the config file:

available_node_types:
    ray.head.default:
        resources: {"CPU":1, "GPU":1}

But I got this error “The field available_node_types is not supported for on-premise clusters.”

Manually Installing and running the ray scripts on the host work fine and the GPU are detected, the issue is when using the docker containers launched via the autolauncher.

I’ve looked at the cluster configuration spec, but the configs seem to be supported for other cloud environments, not for on-prem.

Any help will be appreciated, thanks.

Quick update. I solved the issue adding the --gpus all to the run_options section in the config file. For example:

docker:
  container_name: ray_container
  image: rayproject/ray-ml:latest-gpu
  pull_before_run: true
  run_options:
  - --ulimit nofile=65536:65536
  - --gpus all

1 Like