Suprisingly low GPU usage rate in RlLib

PhilippWillms · August 16, 2024, 7:54pm

Dear community,

I am wondering about suprisingly low GPU usage on my Windows machine for DQN training with RlLib. My interest is in both directions: On the one hand, training should be conduced fast, on the other hand, RAM consumption must be closely observed.

I am running a config similar to the one below - just changing to CartPole instead of my custom environment.

Two strange observations:

While the runtime increases iteration-by-iteration, e.g, iteration #34 required 23 min while iteration #8 required 5 min, the CPU consumption lowers down to 34-40% at average in the later iterations.
The GPU consumption stays very low (1%) and just shows peaks.

First idea to explain the sporadic peaks is the policy update at the end of a rollout batch.
Is this true?

Code:

config = (
    DQNConfig()
    .environment(CustomEnv())
    .rollouts(num_rollout_workers=3, num_envs_per_worker=1, batch_mode="complete_episodes")
    .framework("torch")
    .experimental(_enable_new_api_stack=False)
    .evaluation(evaluation_num_workers=1, evaluation_interval=1000)
    .resources(num_gpus=1, num_cpus_per_worker=3, num_gpus_per_worker=0.2)
    .debugging(log_level="ERROR")
    .reporting(
        min_sample_timesteps_per_iteration=500
    )  # Ensures that in "progress.csv" the timesteps are not listed separately for each iteration.
    .training(
        hiddens=[],
        dueling=False,
        train_batch_size=train_batch_size,
        training_intensity=False,
    )
)

iteration_num = 50
for iteration in tqdm(range(iteration_num)):
    result = algo.train()

tlaurie99 · September 3, 2024, 7:46pm

Hey @PhilippWillms,

I would make a few comments and maybe a few will stick. I am by no means an expert with hardware utilization, but from my experience using large clusters, GPUs are very efficient and made for this sort of matrix multiplication for neural networks.

With that, if you have a relatively small network (sub 1024 hidden layer size) with low batch size, and limited number of workers (3) there doesn’t seem a lot for the GPU to do.

For instance, I will train with 256,000 batch size, 32,000 minibatch size, 30 sgd iterations per minibatch, networks of [4096, 4096] with 2-3 layers and still have some room on two A100 GPUs. Even with this large data amount, the GPU utilization (using the ray dashboard) is only during the PPO.trainer and shows usage only for seconds.

It’s not scientific by any means, but my anecdotal experience. Let me know if you figure it out!

Tyler

mannyv · September 4, 2024, 2:58pm

Hi @tlaurie99,

Have you tried tuning the sgd iterations per minibatch? I have found that for the environments and problems I work woth the ideal value tends to be between 5-10 and if I go over 10 I get significantly reduced returns.

Also large values for sgd iterations increases the training time per iteration because they must be done serially.

tlaurie99 · October 1, 2024, 6:17pm

Hey @mannyv my apologies for the late reply. No, this is not something that was on my radar. I will tinker around with it as well and see what the results are. You bring up a great point!

Topic		Replies	Views
Training and inference ONLY using GPUs and no CPUs RLlib	7	1816	April 12, 2021
RLlib slows down when gpu available but not used RLlib	0	351	April 7, 2021
Utilization of resources by RLlib RLlib	2	348	November 7, 2023
GPU Detected but Not Utilized in Ray RLlib with PPO RLlib	1	523	June 15, 2024
PPO: GPU available, but not utilized Debugging and performance tuning	4	91	April 1, 2025

Suprisingly low GPU usage rate in RlLib

Related topics