GPUs Causing Slower Training

Hello. Thank you in advance for your response.

I am working on a project that uses Ray Tune and I’ve noticed slower performance when GPU drivers are recognized by Ray.

I set up two GCP VMs. They are both machine type n1-standard-32 (32 vCPUs, 120 GB memory) with one of the VMs have 4 x NVIDIA Tesla K80 GPUs. Both VMs have a Ubuntu 18.04 base image.

The NVIDIA drivers I installed are Cuda compilation tools, release 10.1, V10.1.243 and libcudnn.so.7.6.5

I used this script to test each VM. Keeping everything else the same, I changed some of the parameters in tune.run().

analysis = tune.run(
    train_mnist,
    name="exp",
    scheduler=sched,
    metric="mean_accuracy",
    mode="max",
    stop={
        "mean_accuracy": 0.99,
        "training_iteration": num_training_iterations
    },
    num_samples=64,
    resources_per_trial={
        "cpu": 1,
        "gpu": 0
    },
    config={
        "threads": 2,
        "lr": tune.uniform(0.09, 0.1),
        "momentum": tune.uniform(0.8, 0.9),
        "hidden": tune.randint(32, 64),
    })

And I also added a timer.

if __name__ == "__main__":
    import time
    start = time.time()
    tune_mnist(num_training_iterations=50)
    print(time.time() - start)

Using conda, I set up my environment with:

conda create -n ray-env-1_2 tensorflow=2.2 pandas
pip install ray[tune]==1.2

Running the same script in the same environment on each VM resulted in runtimes:

  • CPU VM: 83.4s
  • GPU VM: 105.7s

I do notice that, despite the conda environment not have a tensorflow that is “GPU compatible”, Ray recognizes the GPUs since it prints out Resources requested: 32/32 CPUs, 0/4 GPUs

I exported each environment’s yaml file and was able to see the environments were exactly the same. I also tried Ray version 1.4 but the runtime was the same.

I then created an environment that could utilize the GPUs. I created a new conda environment with:

conda create -n ray-env-1_2-gpu tensorflow-gpu=2.2 pandas
pip install ray[tune]==1.2

Using the following settings resources_per_trial={"cpu": 1, "gpu": 0}. New runtime:

  • GPU VM: 125.8s

Changing the settings to resources_per_trial={"cpu": 1, "gpu": 1}. New runtime:

  • GPU VM: 344.8s

Noteworthy

  • In the conda environment with tensorflow-gpu, I was able to see GPUs with tf.config.list_physical_devices('GPU')
  • If resources_per_trial={"cpu": 1, "gpu": 0}, I will see the error message E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected. Setting os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3" does not make a difference to the runtime or the error message.
  • If resources_per_trial={"cpu": 1, "gpu": 1} the CUDA_ERROR_NO_DEVICE goes away, but the runtime is much slower.
  • I tried ray.init(num_gpus=0) and noticed the GPUs are no longer reported in Resources requested: 32/32 CPUs, 0/0 GPUs, however the runtime is still the same at 125s

I am new to Ray and have been tasked to make performance comparisons between using CPUs and GPUs. However, I can not have confidence in any comparisons I make if there is such a wide range of performance times. Can someone please explain to me what is going on?

I saw here Richard said

setup for the distributed job actually takes quite a long time (25 seconds).

I could theorize that because Ray recognizes the GPUs, it spends additional time distributing jobs. The CPU VM perhaps has better performance because there are no GPUs that would cause Ray to slow down to distribute jobs?

He goes on to say

Obviously, if your training run takes 1 hour, this will not be an issue.

However, I am running experiments for work and I am seeing significant differences that are an issue. Experiments that last 80.5 min using a VM that has GPUs and resources_per_trial={"cpu": 1, "gpu": 0} versus a VM without any GPUs and a runtime of 54.26 min.

It seems to me that Ray Tune is taking additional steps due to the recognition of GPUs existing. Is there a setting to hide the GPUs so that it may have the same performance as a VM without any GPUs? Is there another way to use Ray Tune and compare a VM with and without GPUs? Any help is appreciated!

Hey @hart thanks for posting this!

Just to make sure I’m understanding this correctly, you timed 4 different experiment setups

  1. CPU VM, 1 CPU per trial, tensorflow v2.2: 83.4s
  2. GPU VM, 1 CPU per trial, tensorflow v2.2: 105.7s
  3. GPU VM, 1 CPU per trial, tensorflow-gpu v2.2: 125.8s
  4. GPU VM, 1 CPU & 1 GPU per trial, tenosrflow-gpu v2.2: 344.8s

and the expected behavior is that setup 4 should be the fastest.

For TF v2.0 and up, the base tensorflow package supports GPU already (GPU support  |  TensorFlow). You shouldn’t need to install a separate tensorflow-gpu package. So I don’t think there actually should be any difference between setup 2 and 3. The difference in time could just be due to variance.

Another thing to check- for setup 4, can you confirm that your Keras model is actually using GPUs? You can follow the steps here to confirm this python - Can I run Keras model on gpu? - Stack Overflow.

Also, a reason why the time for 4 is much lower could be because it is actually has less concurrency. Since you only have 4 GPUs, if each trial requires 1 GPU, then you can only run 4 trials in parallel. But if you each trial only requires 1 CPU, then 32 trials can run in parallel. For a fair comparison, I would measure the time for an individual trial (which should already be reported by Tune), or set num_samples=4.

Lastly, the difference between setup 1 and setup 2 might be due to Tensorflow. Do you see the same behavior if you try out a simple Tune example that doesn’t include Tensorflow?

Hi @amogkam thanks for your response. Please see my responses below.

Just to make sure I’m understanding this correctly, you timed 4 different experiment setups

  1. CPU VM, 1 CPU per trial, tensorflow v2.2: 83.4s
  2. GPU VM, 1 CPU per trial, tensorflow v2.2: 105.7s
  3. GPU VM, 1 CPU per trial, tensorflow-gpu v2.2: 125.8s
  4. GPU VM, 1 CPU & 1 GPU per trial, tenosrflow-gpu v2.2: 344.8s

and the expected behavior is that setup 4 should be the fastest.

That is correct. I would expect 1-3 to have the same runtime. As I discover below, it makes sense for experiment 4 to have the slowest runtime due to concurrency issues since the VM only has four GPUs.

For TF v2.0 and up, the base tensorflow package supports GPU already GPU support | TensorFlow. You shouldn’t need to install a separate tensorflow-gpu package. So I don’t think there actually should be any difference between setup 2 and 3. The difference in time could just be due to variance.

From my experience, TF v2.0 and up has GPU support if installing with pip. Installing TF v2.0 using conda does not come with cudnn or cudatoolkit packages. That is why I went with tensorflow-gpu, which of course comes with tensorflow. I agree that there shouldn’t be any time difference between 2 and 3. I’ve run these tests multiple times and they are consistently around 105s and 125s, so it’s hard for me to imagine variance is causing the time differences.

Another thing to check- for setup 4, can you confirm that your Keras model is actually using GPUs? You can follow the steps here to confirm this python - Can I run Keras model on gpu? - Stack Overflow.

I am able to see the GPU devices using tf.config.list_physical_devices('GPU') and device_lib.list_local_devices(). When running experiment number 4 (1 CPU & 1 GPU), I am able to watch nvidia-smi and see in fact the GPUs are being used.

Also, a reason why the time for 4 is much lower could be because it is actually has less concurrency. Since you only have 4 GPUs, if each trial requires 1 GPU, then you can only run 4 trials in parallel. But if you each trial only requires 1 CPU, then 32 trials can run in parallel. For a fair comparison, I would measure the time for an individual trial (which should already be reported by Tune), or set num_samples=4 .

I ran a new series of tests with num_samples = 4, 32, and 64. I also narrowed the hyperparameter range and increased the number of epochs.

batch_size = 128
num_classes = 10
epochs = 20
num_samples=4,
"threads": 2,
"lr": tune.uniform(0.09, 0.1),
 "momentum": tune.uniform(0.8, 0.9),
"hidden": tune.randint(128, 140),
num_training_iterations=50

The results are as follows:

As you mentioned, I took a look at the time for individual trails that have close to the same hidden, lr, momentum hyperparameters and recorded the following (individual trails are for num_samples=64):

  1. CPU VM, 1 CPU per trial, tensorflow v2.2: 92.9s
  2. GPU VM, 1 CPU per trial, tensorflow v2.2: 102s
  3. GPU VM, 1 CPU per trial, tensorflow-gpu v2.2: 120.2s
  4. GPU VM, 1 CPU & 1 GPU per trial, tenosrflow-gpu v2.2: 31.7s

As you mentioned, the limited number of GPUs limits the numbers of trails run concurrently, but it is clear individual trails are significantly faster using GPUs.

I still do not understand how there could be such wide variation between each experiment (excluding GPU VM, 1 CPU & 1 GPU per trial, tenosrflow-gpu v2.2). I could see the tensorflow installed using conda install tensorflow=2.2 and conda install tensorflow-gpu=2.2 being slightly different resulting in different runtimes, however that does not explain the time differences with CPU VM, 1 CPU per trial, tensorflow v2.2 and GPU VM, 1 CPU per trial, tensorflow v2.2.

Lastly, the difference between setup 1 and setup 2 might be due to Tensorflow. Do you see the same behavior if you try out a simple Tune example that doesn’t include Tensorflow?

I tried tune_basic_example on each VM and runtimes were the same. So it does appear (on the surface) to be some issue with Ray interacting with tensorflow. Again I’m new to Ray, but I can’t imagine this is only a tensorflow issue since Ray Tune is responsible for recognizing and assigning GPU usage? It appears that since Ray recognizes the GPUs, additional steps are being taken to either assign GPU resources or not to assign the resources.