Trials placed on the same GPU on a 2 GPU machine despite "num_gpus": 1

vakker00 · April 1, 2021, 9:12am

If I run the following experiment Ray places the models on the same GPU. Is this expected?

import gym
import numpy as np
import ray
from gym.spaces import Box, Discrete
from ray import tune


class SimpleCorridor(gym.Env):
    def __init__(self, config):
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = Discrete(2)
        self.observation_space = Box(0.0,
                                     self.end_pos,
                                     shape=(1, ),
                                     dtype=np.float32)

    def reset(self):
        self.cur_pos = 0
        return [self.cur_pos]

    def step(self, action):
        assert action in [0, 1], action
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        elif action == 1:
            self.cur_pos += 1
        done = self.cur_pos >= self.end_pos
        return [self.cur_pos], 1.0 if done else -0.1, done, {}


if __name__ == "__main__":
    ray.init()

    config = {
        "env": SimpleCorridor,
        "env_config": {
            "corridor_length": 5,
        },
        "num_gpus": 1,
        "lr": tune.grid_search([1e-4, 1e-4]),
        "num_workers": 2,
        "num_envs_per_worker": 1,
        "framework": "torch",
    }

    stop = {"training_iteration": 10}
    results = tune.run('PPO', config=config, stop=stop, verbose=1)

The reported resource reqs:

Resources requested: 8.0/24 CPUs, 2.0/2 GPUs, 0.0/17.1 GiB heap, 0.0/8.55 GiB objects (0.0/1.0 accelerator_type:GTX)

Nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   55C    P2    99W / 250W |   1571MiB / 11178MiB |     63%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:41:00.0 Off |                  N/A |
|  0%   46C    P8    13W / 250W |     14MiB / 11176MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                                
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1581      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   1528882      C   ray::PPO.train_buffered()         781MiB |
|    0   N/A  N/A   1528905      C   ray::PPO.train_buffered()         781MiB |
|    1   N/A  N/A      1581      G   /usr/lib/xorg/Xorg                  8MiB |
|    1   N/A  N/A      1898      G   /usr/bin/gnome-shell                3MiB |
+-----------------------------------------------------------------------------+

Python: 3.8
Ray: 87c79553e94
Os: Ubuntu 20.04

eoakes · April 1, 2021, 2:56pm

Hmm this doesn’t look right to me. They should be placed on different GPUs. @sven1977 could you please confirm that there isn’t a configuration issue here?

smorad · April 12, 2021, 12:24am

You are using grid_search, so I suspect tune is running both trials at once. Remove the grid_search and you should only see one train process.

mannyv · April 12, 2021, 2:20am

Hi @vakker00,

Does it run on seperate gpus if you run it like this?

results = tune.run('PPO', config=config, stop=stop, verbose=1, resources_per_trial={"gpu": 1})

Also, instead of grid_searching the lr like that, if you want two independent training runs, you can pass in num_samples=2.

vakker00 · April 13, 2021, 11:35am

I installed Ray from the latest build and I couldn’t reproduce the issue any more. I’m not sure what caused it.

Topic		Replies	Views
Training trials in parallel on multi-gpu machine Ray Tune	8	1710	August 23, 2021
Using specific GPUs in a shared machine Ray Tune	6	2898	March 24, 2022
Multiple trials on each GPU Ray Tune	1	487	February 19, 2021
Ray Train/Tune issue: concurrent trials conflict on GPU nodes Ray Tune	2	49	February 12, 2025
Pytorch uses only one cpu per trial Ray Tune	2	554	December 3, 2021

Trials placed on the same GPU on a 2 GPU machine despite "num_gpus": 1

Related topics