When running trials with ray tune and the rllib library, we notice that the wait call of the CoreWorkers is 3 times slower when running 3 trials, 2 times slower when running 2 trials, etc. Which does not seem logical given the fact that ray should scale properly. We have 32 threads, and use 9 (+1) workers per trial. So we utilize 30/32 threads. On a threadripper with 128 threads, we see the same behavior. Two runs with 15 (+1) workers result in a slowdown of 2x compared to one run with 15 (+1) workers. Even though a lot of resources are available.
Output of cProfile:
1 Trial:
ncalls tottime percall cumtime percall filename:lineno(function)
(PPO pid=4081844) 8 12.933 1.617 12.933 1.617 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4081844) 60 0.382 0.006 0.382 0.006 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4081844) 1 0.159 0.159 0.169 0.169 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)
(PPO pid=4081844) 600 0.124 0.000 0.135 0.000 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4081844) 2534/761 0.103 0.000 0.171 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
2 Trials:
ncalls tottime percall cumtime percall filename:lineno(function)
(PPO pid=4088430) 8 23.784 2.973 23.784 2.973 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4088430) 600 0.455 0.001 0.467 0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4088430) 60 0.443 0.007 0.443 0.007 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4088430) 1 0.198 0.198 0.198 0.198 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)
(PPO pid=4088430) 2534/761 0.121 0.000 0.191 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
(PPO pid=4088430) 2600 0.111 0.000 0.111 0.000 {method 'to' of 'torch._C._TensorBase' objects}
3 Trials:
ncalls tottime percall cumtime percall filename:lineno(function)
(PPO pid=4098362) 8 35.518 4.440 35.518 4.440 {method 'wait' of 'ray._raylet.CoreWorker' objects}
(PPO pid=4098362) 60 0.631 0.011 0.631 0.011 {method 'run_backward' of 'torch._C._EngineBase' objects}
(PPO pid=4098362) 600 0.517 0.001 0.529 0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:36(__init__)
(PPO pid=4098362) 300 0.234 0.001 0.245 0.001 /miniconda3/envs/justin/lib/python3.8/site-packages/torch/distributions/distribution.py:262(_validate_sample)
(PPO pid=4098362) 1 0.222 0.222 0.222 0.222 /miniconda3/envs/justin/lib/python3.8/site-packages/ray/rllib/policy/rnn_sequencing.py:218(chop_into_sequences)
Notice the cumtime
We are trying to find what causes this linear decrease in performance, so we try to run the same run with 3 different seeds using the following tune_config = tune.TuneConfig(reuse_actors=True, num_samples=1, max_concurrent_trials=3)
. Adding or decreasing the amount of seeds decreases or increases performance linearly. GPU is also split proportionally to the amount of seeds (number of runs). Also, the GPU still has enough spare resources. (50% ram utilization on a RTX 3080 TI).
We are currently using ray 2.3. But upgrading to 2.5.1 does not change this behavior. We use PyTorch 2.0.1 with cuda 11.7. The standard implementation of PPO is used. We use a batch_size of 18000 with mini batches of 1500.
We would love to get some help into finding out what causes this unexpected scaling behaviour.
Thank you!