Using specific GPUs in a shared machine

At work we share the computers with GPUs. A colleague needs to use 2 out of the 4 gpus and I need the other 2. She uses 0 and 1 and I need to use 2 and 3.

I read here (using ray with GPU) that “Ray will automatically set the CUDA_VISIBLE_DEVICES environment variable.”.

I don’t like this behavior. We usually run scripts that set the devices o she has 0,1 and I use 2,3.

Is there a way to solve this?

Does this work for your case?

Thank you. Yes ideally this should work. But:

  1. Ray tune docs say that the variable is reset and changed.
  2. if I write gpus=2 it uses always 0 and 1, and not 2 and 3, for example.

So I was wondering if Ray tune has the hability that one could say, use_only=[2,3] or something

I’ll try to make it work with the environment variable in the meantime.

Hey @lesolorzanov
I took another look at our documentation with code. I believe that user could set CUDA_VISIBLE_DEVICE through env var. And if that is set, Ray will respect that in the sense that only GPUs in this list will be returned.
See worker.py - ray-project/ray - Sourcegraph

1 Like

Please give it a try and let me know if you run into any issues.

Thank you yes, indeed I can set it at the beginning, I use:

os.environ["CUDA_VISIBLE_DEVICES"]="0,1,3"

in my python code.

Edit: nevermind it seems that if I don’t write cpu in resources per trial it works now, it is placing a trial per gpu.

thank you.

The problem I have now is that it is trying to put all the trials in the same gpu at the same time and I am getting the cuda memory error.

I have 16 cpu cores and 4 gpus in the computer but I am only allowed to use gpus: 0 1 and 3.

My ideal scenario would be that say that I have 3 trials, My hope would be that when I call tune.run each of those 3 trials would start at the same time in the 3 gpus. But it’s just not working.

My tune.run looks like this


tune.run(
    trainer,
    name=name, 
    config=config,
    stop={'stop_flag': 0.95, 
            'training_iteration': 5,
            },
    local_dir=path_out_base, 
    checkpoint_at_end=False,
    resources_per_trial={'cpu':5, 'gpu': 1},
    num_samples=1)

Hi @lesolorzanov, that should work though (we use it all the time in our end to end testing). Just out of curiosity, are you using a grid search? Because num_samples=1 means you’re only going to start 1 sample otherwise.

If that’s the case, it seems that it’s stil lscheduling on the wrong GPU. Can you confirm with nvidia-smi where the trials are scheduled exactly?