Linear slowdown when running multiple trials with PPO

JonathanvWestendorp · July 12, 2023, 3:24pm

When running trials with ray tune and the rllib library, we notice that the wait call of the CoreWorkers is 3 times slower when running 3 trials, 2 times slower when running 2 trials, etc. Which does not seem logical given the fact that ray should scale properly. We have 32 threads, and use 9 (+1) workers per trial. So we utilize 30/32 threads. On a threadripper with 128 threads, we see the same behavior. Two runs with 15 (+1) workers result in a slowdown of 2x compared to one run with 15 (+1) workers. Even though a lot of resources are available.

Output of cProfile:
1 Trial:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
(PPO pid=4081844)         8   12.933    1.617   12.933    1.617 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4081844)        60    0.382    0.006    0.382    0.006 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4081844)         1    0.159    0.159    0.169    0.169 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)
(PPO pid=4081844)       600    0.124    0.000    0.135    0.000 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4081844)  2534/761    0.103    0.000    0.171    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}

2 Trials:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
(PPO pid=4088430)         8   23.784    2.973   23.784    2.973 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4088430)       600    0.455    0.001    0.467    0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4088430)        60    0.443    0.007    0.443    0.007 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4088430)         1    0.198    0.198    0.198    0.198 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)
(PPO pid=4088430)  2534/761    0.121    0.000    0.191    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
(PPO pid=4088430)      2600    0.111    0.000    0.111    0.000 {method 'to' of 'torch._C._TensorBase' objects}

3 Trials:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
(PPO pid=4098362)         8   35.518    4.440   35.518    4.440 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4098362)        60    0.631    0.011    0.631    0.011 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4098362)       600    0.517    0.001    0.529    0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4098362)       300    0.234    0.001    0.245    0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:262(_validate_sample)
(PPO pid=4098362)         1    0.222    0.222    0.222    0.222 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)

Notice the cumtime

We are trying to find what causes this linear decrease in performance, so we try to run the same run with 3 different seeds using the following tune_config = tune.TuneConfig(reuse_actors=True, num_samples=1, max_concurrent_trials=3). Adding or decreasing the amount of seeds decreases or increases performance linearly. GPU is also split proportionally to the amount of seeds (number of runs). Also, the GPU still has enough spare resources. (50% ram utilization on a RTX 3080 TI).

We are currently using ray 2.3. But upgrading to 2.5.1 does not change this behavior. We use PyTorch 2.0.1 with cuda 11.7. The standard implementation of PPO is used. We use a batch_size of 18000 with mini batches of 1500.

We would love to get some help into finding out what causes this unexpected scaling behaviour.

Thank you!

jjyao · July 12, 2023, 6:08pm

@JonathanvWestendorp,

Does this have major impact on the overall runtime of your workload? It might be expected since when you have more trails, there are more objects to be waited on and wait happens in the centralized controller process.

JonathanvWestendorp · July 13, 2023, 7:48am

Well, in theory, our systems should be able to handle a lot (around 8) parallel runs: There are more than enough CPU and GPU cores and a lot of RAM. It would definitely speed up the runtime if we could parallelize our trials. We did not expect entirely separate runs to affect each other’s speed.

there are more objects to be waited on and wait happens in the centralized controller process.

Does this mean that it could be better to start multiple different trails without using Tune, i.e. a centralized controller?

Do you have any tips on how to make parallelizing our runs advantageous?

Dylan_Prins · July 13, 2023, 9:57am

Same problem here,

Linear decrease in performance should not be the case if more than enough resources available to perform everything in parallel.

jjyao · July 19, 2023, 6:14pm

@JonathanvWestendorp

When you say linear decrease in perf, do you mean the decrease in the overall runtime or just the time spent in {method 'wait' of 'ray._raylet.CoreWorker' objects}

kourosh · July 19, 2023, 6:27pm

cc @kai to get tune’s perspective as well.

JonathanvWestendorp · August 8, 2023, 8:07am

Overall runtime (including wait method).

Topic		Replies	Views
Solving multiple trials with tune.grid_search() RLlib	4	340	March 4, 2022
Ray Tune process hangs - thread synchronization issue? Ray Tune	16	767	February 10, 2023
Very slow run and only nan results when using cluster of 64 cpus RLlib	5	467	March 14, 2023
Increasing the number of rollout worker doesn´t increase the performance Configure Algorithm, Training, Evaluation, Scaling	0	219	December 24, 2023
Ray Tune v2 performance regression Ray Tune	9	428	February 1, 2023

Linear slowdown when running multiple trials with PPO

Related topics