How to obtain GPU Isolation with TorchTrainer on a multi-GPU node?

sahandrez · October 25, 2025, 9:11pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.48
Python version: 3.12
OS: Linux
Cloud/Infrastructure:
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: On a DDP strategy with TorchTrainer on a multi-GPU setup, each worker sees all GPUs on the node, and CUDA_VISIBLE_DEVICES is set to all the GPUs on that node.
Actual: I want to achieve better isolation: having CUDA_VISIBLE_DEVICES for each worker to be set for that specific GPU the worker is intended to use.
The reason: This is the only reliable way to have vLLM to use a specific GPU (using vLLM in RL training), and setting it later on in the training code does not have any effect since the CUDA context has already been initialized at that point.

I tried various methods, such as using worker_process_setup_hook to set up the CUDA_VISIBLE_DEVICES during the setup of the worker and using specific placement strategies. But all have failed. I feel like there should be an easy solution to this and I may be missing some details.

Setting the resources_per_worker in ScalingConfig also does not result in the intended behaviour: resources_per_worker = {“GPU”: 1,“CPU”: self.num_cpus_per_worker}

Thanks in advance.

@RunLLM

RunLLM · October 25, 2025, 9:11pm

Your message appears to be a template for reporting Ray issues, not a specific question or problem. To help you, please provide details for each section: Ray version, Python version, OS, cloud/infrastructure, and a description of what happened versus what you expected.

If you want to report a bug or get help, it’s best to fill out the template as shown in the official Ray bug report guidelines. Would you like more detail on how to structure your report or where to submit it?

Sources:

Ray bug report template

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Run DD-PPO in multiple GPUs RLlib	2	386	September 30, 2021
GPU Scaling configuration for Tensorflow/Horovod/Pytorch Ray Tune	3	573	April 10, 2023
Ray.tune with pytorch: only uses 1 of 4 GPUs	1	329	May 15, 2023
Resource Allocation Issue When Using TorchTrainer with Tuner Ray Tune	0	38	August 17, 2024
Ray Train with Horovod does not use all GPUs on the node Ray Train	11	975	June 8, 2022

How to obtain GPU Isolation with TorchTrainer on a multi-GPU node?

Related topics