Slow hyperparameter search and training in HPC cluster

Clemente_Juan_Oliver · June 25, 2024, 8:41am

Hi everyone,

I am currently working on training an autonomous decision-making satellite constellation using PPO with up to 86400 simulation steps (1 step per second). Despite using powerful hardware, including 4 NVIDIA A100 GPUs and 11 CPUs, the training process is still very slow. Here’s a summary of my setup and configurations:

Software and Library Versions:

Container Base: NVIDIA PyTorch 22.09 py3
Libraries:
numpy==1.23.5
gymnasium==0.28.1
matplotlib==3.7.1
pandas==1.5.3
ray==2.10.0
ray[tune]==2.10.0
typer==0.7.0
dm_tree
tree
scikit-image
lz4
gputil==1.4.0
pyarrow

Environment Configuration:

num_targets: 10
num_observers: 10
time_step: 1 second
duration: 86400seconds - 24 hours

Training Configuration:

batch_mode: “complete_episodes”
rollout_fragment_length: “auto”
num_rollout_workers: 10
num_envs_per_worker: 1
num_cpus_per_worker: 1
num_gpus_per_worker: 0
num_learner_workers: 4
num_cpus_per_learner_worker: 1
num_gpus_per_learner_worker: 1

Search Space Configuration:

fcnet_hiddens: [[64, 64], [128, 128], [256, 256], [64, 64, 64]]
num_sgd_iter: [10, 30, 50]
lr: [1e-5, 1e-3]
gamma: [0.9, 0.99]
lambda: [0.9, 1.0]
train_batch_size: [512, 1024, 2048, 4096]
sgd_minibatch_size: [32, 64, 128, 512]

The training process is extremely slow, with a single iteration not completed in 1 hour (hyperparameter search with 30 samples, 20 iterations will take over a month). I expected faster training with 4 A100 GPUs. Here are a few things I tried and observed:

Most of the time is taken by training, not environment simulation. With 10 rollout workers, 1 env per worker and 1 CPU per worker it takes around 5 mins to simulate the environments.
I tried truncated episodes, but if the simulation does not finish, it does not get any rewards (although it does in reality, but it is some config), so to choose the best sample is not feasible
Increasing the number of CPUs per worker to 7 (70 CPUs) actually slowed down the process, taking 15 minutes or so to do the simulations.
I could train on my MacBook Pro M2 (like 20 iterations or so) and also on the NVIDIA Jetson AGX Orin, but of course each iteration was around 90 mins and of course they had smaller batch sizes and adapted parameters. Therefore I would expect a faster training with the new hardware.
I do not fully understand this part from the example below: 0.0/1.0 accelerator_type:A100

Here you have an example:
Trial status: 1 RUNNING | 3 PENDING
Current time: 2024-06-25 10:33:42. Total running time: 1hr 1min 6s
Logical resource usage: 11.0/80 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:A100)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status gamma lr train_batch_size sgd_minibatch_size num_sgd_iter lambda model/fcnet_hiddens │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_FSS_env-v0_17a5c_00000 RUNNING 0.928133 1.38519e-05 2048 128 30 0.905579 [128, 128] │
│ PPO_FSS_env-v0_17a5c_00001 PENDING 0.972168 8.7419e-05 1024 128 10 0.957066 [256, 256] │
│ PPO_FSS_env-v0_17a5c_00002 PENDING 0.906353 2.12536e-05 2048 512 50 0.968513 [64, 64] │
│ PPO_FSS_env-v0_17a5c_00003 PENDING 0.948159 0.000202029 512 64 30 0.997584 [64, 64] │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Any suggestions or insights on further optimizing the training process would be greatly appreciated!

Topic		Replies	Views
Slow training, learn_time_ms RLlib	0	331	October 19, 2021
PPO Training takes double the time of CPU on GPU RLlib	2	1509	June 4, 2022
Hyperparameters of PPO on Ray Cluster	0	288	July 18, 2023
Scaling Battle Experiments RLlib	1	370	January 26, 2022
[Clusters] [SGD] Cluster setup speed Ray Clusters	9	1183	April 12, 2021

Slow hyperparameter search and training in HPC cluster

Related topics