I use ray to train the pytorch model. When num_gpus=1.0 is set, it takes about 15 minutes to train a model. But when num_gpus=0.5 is set, it takes 45 minutes to train a model. It should be noted that the GPU memory is sufficient. Why did it cause such a result?
@xwjiang2010 do you have context for this question?
Hi @JUstForFUN,
Thanks for posting your question!
Even though N trials could be fit onto a GPU without running into OOM, specifying num_gpus to be 1/N may not necessarily be optimum.
For example, CPU-GPU bandwidth could be a limiting factor when multiple trials are sharing one GPU.
Maybe you could try some GPU profiling?
1 Like