Running RLlib on a single GPU on windows: Is torch.distributed required?

Wizard · April 8, 2025, 6:54pm

I have code using RLlib’s PPO algorithm that is running fine on the CPU. I have a Windows computer with a single CUDA capable card. I have installed pytorch with CUDA support and torch.cuda.is_available() yields True. Also my graphics card seems to be detected by pytorch. I set num_gpus_per_learner=1 for the PPO algorithm, which makes the algorithm run on the GPU.

The problem now is that I get the error “RuntimeError: Distributed torch is not available”. And indeed “torch.distributed.is_available()” yields “False”. If I understand the error correctly I would need a distribution backend. As NCCL is not available on Windows I probably should go with Gloo. Now here are my questions:

Is there a way to use GPU support for a single GPU in RLlib without needing any distribution backend? (If I understand things correctly for a single GPU distributed torch should not be needed, right?)
If I need a distribution backend, how can I get Gloo running on windows? I have not found any information on how to install pytorch with Gloo support other than building it myself. Are there any pre-built binaries or do I have to build it on my own. I installed the version supporting CUDA 11.8 and it does not seem to include Gloo.

Thanks!

christina · April 8, 2025, 7:56pm

Hello Wizard! Welcome to the Ray community

To answer your questions: Yes, you can use GPU support for a single GPU in RLlib without needing a distribution backend. The error you’re encountering usually arises when trying to use distributed data parallelism, which isn’t necessary for single GPU usage. Make sure that your configuration doesn’t accidentally enable distributed features. For single GPU usage, setting num_gpus_per_learner=1 should suffice, and you should not need to configure torch.distributed .

More reading abt this in the docs:

As for Gloo, I am not entirely sure as I’m not familiar with it myself. There is a pygloo extension in the Ray project though that you might be able to check out. GitHub - ray-project/pygloo: Pygloo provides Python bindings for Gloo.

Wizard · April 9, 2025, 3:45pm

Thanks for the reply. It’s good to hear that I am on the right track by not using any distribution backend. I had set num_gpus_per_learner=1 but also num_learners=1.

num_learners=1 seems to trigger the distribution backend initialization, which I do not want. So according to documentation I should set num_learners=0. Unfortunately when doing so nothing is executed on the GPU and everything runs on the CPU.

I can verify that by just looking at the GPU load, which is non-existent during training. But also none of my tensors in my custom RLModule subclass is on the GPU, they are all on the CPU. I would have expected RLlib to transfer the module and the tensors to GPU automatically. Could someone point me in the right direction on how to get RLlib to use the GPU instead of CPU for training.

Note that I am using the new api stack. I read somewhere that the new api stack might not transfer the model and tensors to the GPU, but there should be a way to make the new api stack work with a single GPU without a distribution backend, right?

azizi · April 10, 2025, 5:01pm

It appears as if you’re experiencing a problem with distributed training with RLlib and support for GPUs under Windows. In the case of single-GPU environments, you shouldn’t necessarily require a distributed backend such as NCCL or Gloo. The error you’re observing could be due to the way in which RLlib anticipates a distributed setup when dealing with GPUs, although you’re actually only using a single GPU. Normally, you can harness the GPU capabilities of RLlib without the need for distributed backends by making sure that your environment is properly configured for single-GPU usage. But in case RLlib is demanding a distributed configuration, you may have to set it up for execution on a single GPU without invoking distributed requirements. Regarding Gloo support, since it’s not natively supported in pre-compiled PyTorch binaries on Windows, you will probably need to build PyTorch from source in order to access Gloo support for distributed configurations. That being said, in a single-GPU case, you may be able to avoid distributed dependencies and concentrate on getting the PPO algorithm properly set up to execute on your GPU without requiring multi-node or multi-GPU support.!
[/quote]

christina · April 11, 2025, 9:44pm

Hi @Wizard , I’m assuming you’ve already checked out the migration guide here: New API stack migration guide — Ray 2.44.1

I did some research and I’m beginning to think this might be more of a PyTorch <> RLlib interaction that’s making this error pop up, although I probably have to do more reading to fully understand how PyTorch device management works with the new RLlib API.

As for using the GPU, maybe you can try this. You can try to manually ensure that your models and tensors are moved to the GPU using PyTorch’s .to(device) method, where device is set to torch.device("cuda") or something similar , maybe that will help force the GPU. I think generally you will need to override the to() method to help implement your custom RLModule subclass since the new API might not know how to transfer it automatically. (Like model.to(device) and data = data.to(device).) Have you tried that yet?

Also, just to verify, did your code work on the old api just fine?

Stuff I read / Docs:

torch.Tensor.to — PyTorch 2.6 documentation

Wizard · April 18, 2025, 1:01pm

Thanks for getting back to me. It took me a while to have time for the project and check out an older version I had running on the old api stack.

Long story short: The old api stack version seems to be working by just setting num_gpus=1. In my model I can verify, that the data is indeed processed on the gpu then.

For the new api stack: The data never gets transferred to the gpu, so you have a point there, that the new api stack does not do that for some reason. I suspected that before. But shouldn’t there be a method to get the new api stack doing that automatically without me adjusting my model to transfer the data to the gpu manually? So here are my questions:

Is there a procedure to make the new api stack transfer my data automatically to the gpu, if I set the config correctly? (so far I thought I was missing some setting in the config)
If not, this is a bug then, right?
If there is no method to do this automatically, where would be the right place to transfer my data to the gpu? (probably _forward method in RLModule, right?)

Thanks!

Topic		Replies	Views
Error when trying to use gpus during RL training RLlib	4	679	July 21, 2021
PPO with PyTorch backend slow on GPU for Ray 1.0 RLlib	4	402	August 12, 2021
PPO policy in RLIB claims No cuda gpus available despite GPUs being available RLlib	4	403	July 20, 2023
When I convert PPO to DDPPO in rllib for distributed training, it prompts: RuntimeError: No CUDA GPUs are available RLlib	5	686	February 21, 2023
How to use rllib to conduct distributed training on multiple machines at the same time Configure Algorithm, Training, Evaluation, Scaling	5	818	February 20, 2023

Running RLlib on a single GPU on windows: Is torch.distributed required?

Related topics