Hi everyone,
I am currently working on training an autonomous decision-making satellite constellation using PPO with up to 86400 simulation steps (1 step per second). Despite using powerful hardware, including 4 NVIDIA A100 GPUs and 11 CPUs, the training process is still very slow. Here’s a summary of my setup and configurations:
Software and Library Versions:
-
Container Base: NVIDIA PyTorch 22.09 py3
-
Libraries:
numpy==1.23.5
gymnasium==0.28.1
matplotlib==3.7.1
pandas==1.5.3
ray==2.10.0
ray[tune]==2.10.0
typer==0.7.0
dm_tree
tree
scikit-image
lz4
gputil==1.4.0
pyarrow
Environment Configuration:
- num_targets: 10
- num_observers: 10
- time_step: 1 second
- duration: 86400seconds - 24 hours
Training Configuration:
- batch_mode: “complete_episodes”
- rollout_fragment_length: “auto”
- num_rollout_workers: 10
- num_envs_per_worker: 1
- num_cpus_per_worker: 1
- num_gpus_per_worker: 0
- num_learner_workers: 4
- num_cpus_per_learner_worker: 1
- num_gpus_per_learner_worker: 1
Search Space Configuration:
- fcnet_hiddens: [[64, 64], [128, 128], [256, 256], [64, 64, 64]]
- num_sgd_iter: [10, 30, 50]
- lr: [1e-5, 1e-3]
- gamma: [0.9, 0.99]
- lambda: [0.9, 1.0]
- train_batch_size: [512, 1024, 2048, 4096]
- sgd_minibatch_size: [32, 64, 128, 512]
The training process is extremely slow, with a single iteration not completed in 1 hour (hyperparameter search with 30 samples, 20 iterations will take over a month). I expected faster training with 4 A100 GPUs. Here are a few things I tried and observed:
- Most of the time is taken by training, not environment simulation. With 10 rollout workers, 1 env per worker and 1 CPU per worker it takes around 5 mins to simulate the environments.
- I tried truncated episodes, but if the simulation does not finish, it does not get any rewards (although it does in reality, but it is some config), so to choose the best sample is not feasible
- Increasing the number of CPUs per worker to 7 (70 CPUs) actually slowed down the process, taking 15 minutes or so to do the simulations.
- I could train on my MacBook Pro M2 (like 20 iterations or so) and also on the NVIDIA Jetson AGX Orin, but of course each iteration was around 90 mins and of course they had smaller batch sizes and adapted parameters. Therefore I would expect a faster training with the new hardware.
- I do not fully understand this part from the example below: 0.0/1.0 accelerator_type:A100
Here you have an example:
Trial status: 1 RUNNING | 3 PENDING
Current time: 2024-06-25 10:33:42. Total running time: 1hr 1min 6s
Logical resource usage: 11.0/80 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:A100)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status gamma lr train_batch_size sgd_minibatch_size num_sgd_iter lambda model/fcnet_hiddens │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_FSS_env-v0_17a5c_00000 RUNNING 0.928133 1.38519e-05 2048 128 30 0.905579 [128, 128] │
│ PPO_FSS_env-v0_17a5c_00001 PENDING 0.972168 8.7419e-05 1024 128 10 0.957066 [256, 256] │
│ PPO_FSS_env-v0_17a5c_00002 PENDING 0.906353 2.12536e-05 2048 512 50 0.968513 [64, 64] │
│ PPO_FSS_env-v0_17a5c_00003 PENDING 0.948159 0.000202029 512 64 30 0.997584 [64, 64] │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Any suggestions or insights on further optimizing the training process would be greatly appreciated!