Resource allocation for Ray Cluster running on Kubernetes


I am using Pytorch Lightning for training and Ray for Hyper parameter tuning (not using ray_lightning). I have a Kubernetes operator and head(m6a.2xlarge) that can spin up max 5 GPU workers(p2.8xlarge with 8 GPUs, 32 vCPUs).

Documentation says “if connected to existing cluster, you don’t specify resources.”

I don’t need to specify resources on Pytorch Lightning Trainer object too? Also, no need to set up resource_per_trial as well?

Currently if I do gpu = 1 in plt.Trainer(gpus=1…) and resource_per_trial = {“gpus”: 1} and Run 16 experiments, I am seeing 8 experiments on each worker. What I am expecting is, Each experiment running on its own worker and that each experiment using all GPUs to run faster.

What is the best way to allocate resource? I am running Multi Layer Perceptron model using Pytorch Lightning

Also, I started using Ray Lightning to see if that helps in resource allocation efficiently

The documentation about setting up gpus on trainer is unclear

Doctoring says to setup num_gpus : ray_lightning/ at 3adb809aee8d1c6154e044902d359a456f1859ff · ray-project/ray_lightning · GitHub

While readme says
Don’t set gpus in the Trainer.
The actual number of GPUs is determined by num_workers.