1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.48
- Python version: 3.12
- OS: Linux
- Cloud/Infrastructure:
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
-
Expected: On a DDP strategy with TorchTrainer on a multi-GPU setup, each worker sees all GPUs on the node, and
CUDA_VISIBLE_DEVICESis set to all the GPUs on that node. -
Actual: I want to achieve better isolation: having
CUDA_VISIBLE_DEVICESfor each worker to be set for that specific GPU the worker is intended to use.
The reason: This is the only reliable way to have vLLM to use a specific GPU (using vLLM in RL training), and setting it later on in the training code does not have any effect since the CUDA context has already been initialized at that point.I tried various methods, such as using
worker_process_setup_hookto set up theCUDA_VISIBLE_DEVICESduring the setup of the worker and using specific placement strategies. But all have failed. I feel like there should be an easy solution to this and I may be missing some details.Setting the
resources_per_workerinScalingConfigalso does not result in the intended behaviour:resources_per_worker = {“GPU”: 1,“CPU”: self.num_cpus_per_worker}
Thanks in advance.