[RLlib] Need insights on profiling

Hello RLLib community,

as I was profiling some of my experiments, I saw in tensorboard a few entries concerning inference times, env_step_mean_ms etc. I wanted to know if it was possible to know :

  • the time spent in the physical environment during rollouts in comparison with other things ( like model inference , backprops… ) → seems to be the "tune/sampler_perf/mean_env_wait_ms"
  • the time spent during inferences, how to know if it’s big or not? Once again, a tool to compare with other things going on → seems to be the tune/sampler_perf/mean_inference_ms
  • backpropagation of models → seems to be around the tune/timers/learn_time_ms

Solution to 1) seems to be to get more workers and optimize physical env
2) seems to do rollout on GPU/ more envs_per_worker
3) GPU instead of CPU

I’d love some more informations. I have used my pycharm profiler and it didn’t help as there was a lot of abstract classes in the diagram.

Thanks in advance

Hey @Clement_Collgon , your assumptions about the meaning of these stats are all correct :slight_smile:

  1. All correct.
  2. Correct, envs_per_worker will batch more action computations together, yielding some speedup here.
    Also 3) try to increase num_gpus in your config (if the algo supports that, which all PyTorch algos do, but also tf PPO/IMPALA/DQN/SAC/A2C/PG).
1 Like