GPU Detected but Not Utilized in Ray RLlib with PPO

High: It blocks me to complete my task.

Hello, I’ve been working with Ray RLlib for my latest project, specifically using the PPO, DQN, SAC algorithms. My setup includes a CUDA-capable GPU, which is correctly recognized by PyTorch and Ray. However, I’ve encountered an issue where the GPU is detected but not utilized during training, as confirmed by the nvidia-smi command showing low GPU memory usage and compute utilization.

Relevant code part:

use_gpu = torch.cuda.is_available()
if use_gpu:
    print(f"CUDA is available. Number of GPUs: {torch.cuda.device_count()}")
    print("GPU Name:", torch.cuda.get_device_name(0))
    ray.init(num_gpus=1)
else:
    print("CUDA is not available. Running on CPU.")
    ray.init()
if algorithm_name == "PPO":
    algo = (
        PPOConfig()
        .training(train_batch_size=train_batch_size_input, sgd_minibatch_size=sgd_minibatch_size,
                  num_sgd_iter=num_sgd_iter, clip_param=0.2)
        .rollouts(num_rollout_workers=1)
        .resources(num_gpus=1 if use_gpu else 0)
        .framework("torch")  # or .framework("tf") for TensorFlow
        .callbacks(callback_factory)
        .environment(env="bertrand", env_config=environment_config)
        .multi_agent(
            policies=["agent0", "agent1"],
            policy_mapping_fn=(lambda agent_id, *args, **kwargs: agent_id))
        .build()
    )

And console output:

CUDA is available. Number of GPUs: 1
GPU Name: NVIDIA GeForce RTX 4090

GPU usage situation during the whole task:

+--------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08    Driver Version: 545.23.08    CUDA Version: 12.3               |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                  Off |
|  0%   42C    P8              11W / 450W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+--------------------------------------------------------------------------------------+
| Processes:                                                                          |
|  GPU   GI   CI        PID   Type   Process name                          GPU Memory |
|        ID   ID                                                           Usage      |
|=====================================================================================|
|  No running processes found                                                         |
+--------------------------------------------------------------------------------------+

Package:

  • Python 3.10
  • ray 2.6.1
  • CUDA 12.3

Try to experiment around parameters shown in the example below. Recall that RolloutWorker (old API stack) and Learner (new API stack) need to be configured with the resources they consume in terms of (partial) CPU cores and (partial) GPU cores.

Old API stack

 .resources(
       num_gpus=args.num_gpus, num_cpus_per_worker=4, num_gpus_per_worker=0.3
 )

New API Stack

        .resources(
            num_gpus=args.num_gpus,
        )
        .learners(
            num_gpus_per_learner=1
            # Cannot set both `num_cpus_per_learner` > 1 and  `num_gpus_per_learner` > 0! 
            # Either set `num_cpus_per_learner` > 1 (and `num_gpus_per_learner`=0) OR 
            #   set `num_gpus_per_learner` > 0 (and leave `num_cpus_per_learner` at its default value of 1).
            # This is due to issues with placement group fragmentation. 
            # See https://github.com/ray-project/ray/issues/35409 for more details.
        )