Using specific GPUs in a shared machine

lesolorzanov · March 3, 2022, 9:58am

At work we share the computers with GPUs. A colleague needs to use 2 out of the 4 gpus and I need the other 2. She uses 0 and 1 and I need to use 2 and 3.

I read here (using ray with GPU) that “Ray will automatically set the CUDA_VISIBLE_DEVICES environment variable.”.

I don’t like this behavior. We usually run scripts that set the devices o she has 0,1 and I use 2,3.

Is there a way to solve this?

xwjiang2010 · March 3, 2022, 10:09pm

Does this work for your case?

github.com/ray-project/ray

How to init Ray with a specified GPU id to run all trials of Tune?

opened 12:46PM - 02 Aug 20 UTC

closed 02:01AM - 03 Aug 20 UTC

guoxuxu

question

### What is your question? Say, I have 4 GPUs with ids=[0, 1, 2, 3] and I onl…y want to run all trials for Tune on id=2 and id=3 only. That means I can only maximize the use of the third and fourth GPU without touching the first two GPUs. How can I achieve this? ```ray.init(num_cpus=num_cpus, num_gpus=num_gpus, temp_dir=ray_log)``` The attribute ```num_gpus``` is the number of GPUs ray can use. When setting ```num_gpus=1```, all the trials run on the first device (GPU id=0). When increasing ```num_gpus```, all the trials will ordinally use GPUs from id=0 to id=3... I want to know how to specify the exact GPU ids, e.g., all trials run on id=2 and id=3. I've tried specifying GPU id in the training functions, but raised ```RuntimeError: CUDA error: invalid device ordinal```. I'm still new to this great project. Appreciate your warm help! *Ray version and other system information (Python version, TensorFlow version, OS):* OS: Linux Python: 3.7.4 Ray: 0.8.6

lesolorzanov · March 4, 2022, 12:38pm

Thank you. Yes ideally this should work. But:

Ray tune docs say that the variable is reset and changed.
if I write gpus=2 it uses always 0 and 1, and not 2 and 3, for example.

So I was wondering if Ray tune has the hability that one could say, use_only=[2,3] or something

I’ll try to make it work with the environment variable in the meantime.

xwjiang2010 · March 16, 2022, 8:05pm

Hey @lesolorzanov
I took another look at our documentation with code. I believe that user could set CUDA_VISIBLE_DEVICE through env var. And if that is set, Ray will respect that in the sense that only GPUs in this list will be returned.
See worker.py - ray-project/ray - Sourcegraph

xwjiang2010 · March 16, 2022, 8:05pm

Please give it a try and let me know if you run into any issues.

lesolorzanov · March 23, 2022, 3:29pm

Thank you yes, indeed I can set it at the beginning, I use:

os.environ["CUDA_VISIBLE_DEVICES"]="0,1,3"

in my python code.

Edit: nevermind it seems that if I don’t write cpu in resources per trial it works now, it is placing a trial per gpu.

thank you.

The problem I have now is that it is trying to put all the trials in the same gpu at the same time and I am getting the cuda memory error.

I have 16 cpu cores and 4 gpus in the computer but I am only allowed to use gpus: 0 1 and 3.

My ideal scenario would be that say that I have 3 trials, My hope would be that when I call tune.run each of those 3 trials would start at the same time in the 3 gpus. But it’s just not working.

My tune.run looks like this


tune.run(
    trainer,
    name=name, 
    config=config,
    stop={'stop_flag': 0.95, 
            'training_iteration': 5,
            },
    local_dir=path_out_base, 
    checkpoint_at_end=False,
    resources_per_trial={'cpu':5, 'gpu': 1},
    num_samples=1)

kai · March 24, 2022, 11:31am

Hi @lesolorzanov, that should work though (we use it all the time in our end to end testing). Just out of curiosity, are you using a grid search? Because num_samples=1 means you’re only going to start 1 sample otherwise.

If that’s the case, it seems that it’s stil lscheduling on the wrong GPU. Can you confirm with nvidia-smi where the trials are scheduled exactly?

Topic		Replies	Views
Allocating different GPUs to different instances of Ray(Tune) in python	5	523	August 9, 2023
Training trials in parallel on multi-gpu machine Ray Tune	8	1697	August 23, 2021
How do I run my experiment on a single GPU?	4	1677	August 20, 2023
GPU accelarate that can not be used with ray and tune in training PPO RLlib	3	854	December 23, 2023
[SGD][Tune] Do I need to specify cuda device id in TrainOperator Ray Tune	11	391	May 26, 2021

Using specific GPUs in a shared machine

Related topics