Resource allocation for Ray Cluster running on Kubernetes

Manoj_Kumar_Dobbali · March 15, 2022, 3:49pm

Hi,

I am using Pytorch Lightning for training and Ray for Hyper parameter tuning (not using ray_lightning). I have a Kubernetes operator and head(m6a.2xlarge) that can spin up max 5 GPU workers(p2.8xlarge with 8 GPUs, 32 vCPUs).

Documentation says “if connected to existing cluster, you don’t specify resources.”

I don’t need to specify resources on Pytorch Lightning Trainer object too? Also, no need to set up resource_per_trial as well?

Currently if I do gpu = 1 in plt.Trainer(gpus=1…) and resource_per_trial = {“gpus”: 1} and Run 16 experiments, I am seeing 8 experiments on each worker. What I am expecting is, Each experiment running on its own worker and that each experiment using all GPUs to run faster.

What is the best way to allocate resource? I am running Multi Layer Perceptron model using Pytorch Lightning

Manoj_Kumar_Dobbali · March 15, 2022, 4:51pm

Also, I started using Ray Lightning to see if that helps in resource allocation efficiently

The documentation about setting up gpus on trainer is unclear

Doctoring says to setup num_gpus : ray_lightning/ray_ddp.py at 3adb809aee8d1c6154e044902d359a456f1859ff · ray-project/ray_lightning · GitHub

While readme says
Don’t set gpus in the Trainer.
The actual number of GPUs is determined by num_workers.

Topic		Replies	Views
Distributed Training & Distributed Tuning using Ray Tune, PLT, Ray Lightning Ray Clusters	1	375	April 25, 2022
GPU Scaling configuration for Tensorflow/Horovod/Pytorch Ray Tune	3	544	April 10, 2023
Under-utilization of gpus at end of experiment Ray Tune	3	328	December 21, 2022
When to use multi gpus per worker for a training job	1	212	September 15, 2024
[RAY SGD] Train pytorch model on machine with 2 GPUs Ray Tune	2	431	February 19, 2021

Resource allocation for Ray Cluster running on Kubernetes

Related topics