Suprisingly low GPU usage rate in RlLib

Dear community,

I am wondering about suprisingly low GPU usage on my Windows machine for DQN training with RlLib. My interest is in both directions: On the one hand, training should be conduced fast, on the other hand, RAM consumption must be closely observed.

I am running a config similar to the one below - just changing to CartPole instead of my custom environment.

Two strange observations:

  1. While the runtime increases iteration-by-iteration, e.g, iteration #34 required 23 min while iteration #8 required 5 min, the CPU consumption lowers down to 34-40% at average in the later iterations.
  2. The GPU consumption stays very low (1%) and just shows peaks.

First idea to explain the sporadic peaks is the policy update at the end of a rollout batch.
Is this true?

Code:

config = (
    DQNConfig()
    .environment(CustomEnv())
    .rollouts(num_rollout_workers=3, num_envs_per_worker=1, batch_mode="complete_episodes")
    .framework("torch")
    .experimental(_enable_new_api_stack=False)
    .evaluation(evaluation_num_workers=1, evaluation_interval=1000)
    .resources(num_gpus=1, num_cpus_per_worker=3, num_gpus_per_worker=0.2)
    .debugging(log_level="ERROR")
    .reporting(
        min_sample_timesteps_per_iteration=500
    )  # Ensures that in "progress.csv" the timesteps are not listed separately for each iteration.
    .training(
        hiddens=[],
        dueling=False,
        train_batch_size=train_batch_size,
        training_intensity=False,
    )
)

iteration_num = 50
for iteration in tqdm(range(iteration_num)):
    result = algo.train()

Hey @PhilippWillms,

I would make a few comments and maybe a few will stick. I am by no means an expert with hardware utilization, but from my experience using large clusters, GPUs are very efficient and made for this sort of matrix multiplication for neural networks.

With that, if you have a relatively small network (sub 1024 hidden layer size) with low batch size, and limited number of workers (3) there doesn’t seem a lot for the GPU to do.

For instance, I will train with 256,000 batch size, 32,000 minibatch size, 30 sgd iterations per minibatch, networks of [4096, 4096] with 2-3 layers and still have some room on two A100 GPUs. Even with this large data amount, the GPU utilization (using the ray dashboard) is only during the PPO.trainer and shows usage only for seconds.

It’s not scientific by any means, but my anecdotal experience. Let me know if you figure it out!

Tyler

Hi @tlaurie99,

Have you tried tuning the sgd iterations per minibatch? I have found that for the environments and problems I work woth the ideal value tends to be between 5-10 and if I go over 10 I get significantly reduced returns.

Also large values for sgd iterations increases the training time per iteration because they must be done serially.