What is the best configuration for 1 GPU and 1 CPU?

  • Low: It annoys or frustrates me for a moment.

Hi,

I just test Ray Tune with the following code (https://docs.ray.io/en/latest/tune/examples/includes/mnist_ptl_mini.html). It takes 181 seconds to compute the 10 trials.

Now, my computer has a Nvidia 2080 Ti GPU card, so, I want to use it to speed up the process. On the last line, I change gpus_per_trial from 0 to 1 :

tune_mnist(num_samples=10, num_epochs=10, gpus_per_trial=1)

Now, it takes 867 seconds!!! Why more time with a GPU?

What is the best configuration for my configuration (1 CPU with 16 cores and 1 2080 Ti GPU)?

Regards,

Philippe

Hi @Philippe_JUHEL,

Your cluster only has one GPU, so the 10 trials runs sequentially since each one needs 1 GPU. Previously, all 10 trials were running in parallel because they only required 1 CPU each. That’s the reason for the runtime increase.

It’s possible to allocate fractional GPU amounts to each trial (ex: gpus_per_trial=0.25 to run 4 trials concurrently). However, you must make sure that each trial doesn’t exceed its GPU memory usage, as Tune will only assign trials to the correct GPU ids and won’t limit the GPU memory allocation.

One question for you: did you come across A Guide To Parallelism and Resources — Ray 2.2.0 when troubleshooting, and does it answer your question here? We give a few examples of resource allocation in this guide, including an example with fractional CPUs, but don’t mention specifying fractional GPUs per trial here.

1 Like

Hi Justin,

Thank you for your answer.
Yes, I’ve read the [A Guide To Parallelism and Resources — Ray 2.2.0](https://A Guide To Parallelism and Resources — Ray 2.2.0) document but I didn’t understand some points.

For example, if I’ve a 4 cores CPU and I want to do 8 trials, what is the best configuration (shortest computation time) :

 tune.with_resources(trainable, {"cpu": 1})   # 4 concurrent trials at a time

or

tune.with_resources(trainable, {"cpu": 0.5})  # 8 concurrent trials at a time

If I suppose that each trial takes the same amount of time to be processed (duration = T, time needed for 1 trial to be processed by 1 CPU), I expect this :

For solution 1 :

duration = T (first, the first 4 trials are processed by the 4 CPU at the same time)
duration = T (then, the last 4 trials are processed by the 4 CPU)
total duration = T + T

For solution 2 :

At the same time, each CPU has to process 2 concurrent trials, so the duration to finish this 2 trials is 2T.
total duration = 2
T

Is it right to think that each solution will take the same amount of time?

And on the How to leverage GPUs? chapter, in the second example

tune.with_resources(trainable, {"cpu": 2, "gpu": 1})

it’s said :

If you have 4 CPUs and 1 GPU on your machine, this will run 1 trial at a time.

Does it means that one trial will use the GPU AND all the CPU?

Philippe

Hey @Philippe_JUHEL,

Is it right to think that each solution will take the same amount of time?

This is a really tough question to answer definitively since it really depends on your code, the operating system, and what hardware you have.

Generally, if you have 4 cores, then running 4 concurrent trials would be optimal to avoid overhead from OS context switching.

That being said, if your trainable is not CPU bound (and instead it’s I/O bound for example) then running more than 4 concurrent trials might be faster since it allows the operating system to run another trial while this trial is waiting on I/O operations.

However, ultimately the best way to determine which one is faster is just to run both and time them.

Does it means that one trial will use the GPU AND all the CPU?

This means 1 trial will be allocated 2 CPU cores and 1 GPU device. Whether the trial fully utilizes these depends on the code that the trial is running.