The fractional GPUs is very slow?!

JUstForFUN · September 7, 2021, 4:12am

I use ray to train the pytorch model. When num_gpus=1.0 is set, it takes about 15 minutes to train a model. But when num_gpus=0.5 is set, it takes 45 minutes to train a model. It should be noted that the GPU memory is sufficient. Why did it cause such a result?

Chen_Shen · September 8, 2021, 1:38am

@xwjiang2010 do you have context for this question?

xwjiang2010 · September 8, 2021, 4:10am

Hi @JUstForFUN,
Thanks for posting your question!
Even though N trials could be fit onto a GPU without running into OOM, specifying num_gpus to be 1/N may not necessarily be optimum.
For example, CPU-GPU bandwidth could be a limiting factor when multiple trials are sharing one GPU.
Maybe you could try some GPU profiling?

Topic		Replies	Views
Increase efficiency using PyTorch + GPU for inference Ray Core	1	729	July 17, 2022
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1214	February 14, 2023
PPO Training takes double the time of CPU on GPU RLlib	2	1578	June 4, 2022
[Core] Question on optimizing machine learning project speed using ray Ray Core	5	462	February 1, 2022
Using fractional GPU with TorchTrainer and Tuner API	3	915	August 22, 2023

The fractional GPUs is very slow?!

Related topics