Linear slowdown when running multiple trials with PPO

When running trials with ray tune and the rllib library, we notice that the wait call of the CoreWorkers is 3 times slower when running 3 trials, 2 times slower when running 2 trials, etc. Which does not seem logical given the fact that ray should scale properly. We have 32 threads, and use 9 (+1) workers per trial. So we utilize 30/32 threads. On a threadripper with 128 threads, we see the same behavior. Two runs with 15 (+1) workers result in a slowdown of 2x compared to one run with 15 (+1) workers. Even though a lot of resources are available.

Output of cProfile:
1 Trial:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
(PPO pid=4081844)         8   12.933    1.617   12.933    1.617 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4081844)        60    0.382    0.006    0.382    0.006 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4081844)         1    0.159    0.159    0.169    0.169 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)
(PPO pid=4081844)       600    0.124    0.000    0.135    0.000 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4081844)  2534/761    0.103    0.000    0.171    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}

2 Trials:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
(PPO pid=4088430)         8   23.784    2.973   23.784    2.973 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4088430)       600    0.455    0.001    0.467    0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4088430)        60    0.443    0.007    0.443    0.007 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4088430)         1    0.198    0.198    0.198    0.198 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)
(PPO pid=4088430)  2534/761    0.121    0.000    0.191    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
(PPO pid=4088430)      2600    0.111    0.000    0.111    0.000 {method 'to' of 'torch._C._TensorBase' objects}

3 Trials:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
(PPO pid=4098362)         8   35.518    4.440   35.518    4.440 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4098362)        60    0.631    0.011    0.631    0.011 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4098362)       600    0.517    0.001    0.529    0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4098362)       300    0.234    0.001    0.245    0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:262(_validate_sample)
(PPO pid=4098362)         1    0.222    0.222    0.222    0.222 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)

Notice the cumtime

We are trying to find what causes this linear decrease in performance, so we try to run the same run with 3 different seeds using the following tune_config = tune.TuneConfig(reuse_actors=True, num_samples=1, max_concurrent_trials=3). Adding or decreasing the amount of seeds decreases or increases performance linearly. GPU is also split proportionally to the amount of seeds (number of runs). Also, the GPU still has enough spare resources. (50% ram utilization on a RTX 3080 TI).

We are currently using ray 2.3. But upgrading to 2.5.1 does not change this behavior. We use PyTorch 2.0.1 with cuda 11.7. The standard implementation of PPO is used. We use a batch_size of 18000 with mini batches of 1500.

We would love to get some help into finding out what causes this unexpected scaling behaviour.

Thank you!

@JonathanvWestendorp,

Does this have major impact on the overall runtime of your workload? It might be expected since when you have more trails, there are more objects to be waited on and wait happens in the centralized controller process.

Well, in theory, our systems should be able to handle a lot (around 8) parallel runs: There are more than enough CPU and GPU cores and a lot of RAM. It would definitely speed up the runtime if we could parallelize our trials. We did not expect entirely separate runs to affect each other’s speed.

there are more objects to be waited on and wait happens in the centralized controller process.

Does this mean that it could be better to start multiple different trails without using Tune, i.e. a centralized controller?

Do you have any tips on how to make parallelizing our runs advantageous?

Same problem here,

Linear decrease in performance should not be the case if more than enough resources available to perform everything in parallel.

@JonathanvWestendorp

When you say linear decrease in perf, do you mean the decrease in the overall runtime or just the time spent in {method 'wait' of 'ray._raylet.CoreWorker' objects}

cc @kai to get tune’s perspective as well.

Overall runtime (including wait method).