Hi,
I am using Pytorch Lightning for training and Ray for Hyper parameter tuning (not using ray_lightning). I have a Kubernetes operator and head(m6a.2xlarge) that can spin up max 5 GPU workers(p2.8xlarge with 8 GPUs, 32 vCPUs).
Documentation says “if connected to existing cluster, you don’t specify resources.”
I don’t need to specify resources on Pytorch Lightning Trainer object too? Also, no need to set up resource_per_trial
as well?
Currently if I do gpu = 1 in plt.Trainer(gpus=1…) and resource_per_trial = {“gpus”: 1} and Run 16 experiments, I am seeing 8 experiments on each worker. What I am expecting is, Each experiment running on its own worker and that each experiment using all GPUs to run faster.
What is the best way to allocate resource? I am running Multi Layer Perceptron model using Pytorch Lightning