How to obtain GPU Isolation with TorchTrainer on a multi-GPU node?

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.48
  • Python version: 3.12
  • OS: Linux
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: On a DDP strategy with TorchTrainer on a multi-GPU setup, each worker sees all GPUs on the node, and CUDA_VISIBLE_DEVICES is set to all the GPUs on that node.

  • Actual: I want to achieve better isolation: having CUDA_VISIBLE_DEVICES for each worker to be set for that specific GPU the worker is intended to use.
    The reason: This is the only reliable way to have vLLM to use a specific GPU (using vLLM in RL training), and setting it later on in the training code does not have any effect since the CUDA context has already been initialized at that point.

    I tried various methods, such as using worker_process_setup_hook to set up the CUDA_VISIBLE_DEVICES during the setup of the worker and using specific placement strategies. But all have failed. I feel like there should be an easy solution to this and I may be missing some details.

    Setting the resources_per_worker in ScalingConfig also does not result in the intended behaviour: resources_per_worker = {“GPU”: 1,“CPU”: self.num_cpus_per_worker}

Thanks in advance.

@RunLLM

Your message appears to be a template for reporting Ray issues, not a specific question or problem. To help you, please provide details for each section: Ray version, Python version, OS, cloud/infrastructure, and a description of what happened versus what you expected.

If you want to report a bug or get help, it’s best to fill out the template as shown in the official Ray bug report guidelines. Would you like more detail on how to structure your report or where to submit it?

Sources:

Hint: Mention @RunLLM in the post for followups.