At work we share the computers with GPUs. A colleague needs to use 2 out of the 4 gpus and I need the other 2. She uses 0 and 1 and I need to use 2 and 3.
I read here (using ray with GPU) that “Ray will automatically set the CUDA_VISIBLE_DEVICES environment variable.”.
I don’t like this behavior. We usually run scripts that set the devices o she has 0,1 and I use 2,3.
Hey @lesolorzanov
I took another look at our documentation with code. I believe that user could set CUDA_VISIBLE_DEVICE through env var. And if that is set, Ray will respect that in the sense that only GPUs in this list will be returned.
See worker.py - ray-project/ray - Sourcegraph
Thank you yes, indeed I can set it at the beginning, I use:
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,3"
in my python code.
Edit: nevermind it seems that if I don’t write cpu in resources per trial it works now, it is placing a trial per gpu.
thank you.
The problem I have now is that it is trying to put all the trials in the same gpu at the same time and I am getting the cuda memory error.
I have 16 cpu cores and 4 gpus in the computer but I am only allowed to use gpus: 0 1 and 3.
My ideal scenario would be that say that I have 3 trials, My hope would be that when I call tune.run each of those 3 trials would start at the same time in the 3 gpus. But it’s just not working.
Hi @lesolorzanov, that should work though (we use it all the time in our end to end testing). Just out of curiosity, are you using a grid search? Because num_samples=1 means you’re only going to start 1 sample otherwise.
If that’s the case, it seems that it’s stil lscheduling on the wrong GPU. Can you confirm with nvidia-smi where the trials are scheduled exactly?