How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Does anyone know how to configure GPUs when using the Cluster Autolauncher with an on-prem cluster ?
The goal is to use the Cluster Autolauncher with a few Lambdalabs cloud instances to run a DL training job.
These are the steps I’m following (based on this doc):
- Launch (gpu_1x_a10) Lambdalabs instances
- Use the following config file
- Run
ray up lambdalabs-launcher-config.yaml
- Run
ray dashboard lambdalabs-launcher-config.yaml
- Run
RAY_ADDRESS='http://localhost:8265' ray job submit --working-dir . -- python check_gpu_ray.py
The python script is simple:
import torch
import ray;
ray.init();
print(torch.cuda.is_available())
The output is False.
Also, the when accessing the dashboard, the GPU column in the Cluster tab has N/A.
I tried adding this to the config file:
available_node_types:
ray.head.default:
resources: {"CPU":1, "GPU":1}
But I got this error “The field available_node_types is not supported for on-premise clusters.”
Manually Installing and running the ray scripts on the host work fine and the GPU are detected, the issue is when using the docker containers launched via the autolauncher.
I’ve looked at the cluster configuration spec, but the configs seem to be supported for other cloud environments, not for on-prem.
Any help will be appreciated, thanks.