Magnitude slower performance on GCP Cluster vs. macbook pro

ndalton12 · December 2, 2020, 6:25am

Hi all, I am trying to train my deep RL algorithm implemented with rllib. It runs fine on my macbook pro (13 inch, 2018, 4 cores/8 threads) but is, as expected, pretty slow. In order to speed it up a lot, I have been using a ray cluster on GCP.

However, this cluster is an order of magnitude slower than my macbook! The cluster is set to have ~18 cores; for reference, the core speeds listed by GCP are roughly the same as my macbook (or better). Despite this, getting to the first postprocessing phase of the training takes about 5 minutes using all 18 cores, full load. On my macbook, getting to this phase using only 1 worker takes about 1 minute - all training parameters are the same on the two use cases. Using GPU does not help.

From the ray dashboard, the bottleneck is happening under RolloutWorker.par_iter_next(), which is same bottleneck on my macbook. However, this should be much faster on the cloud cluster. Any advice?

ndalton12 · December 3, 2020, 6:16am

So this is a really frustrating error, but it seems a large part of this is because the default pip installation of numpy does not include BLAS/LAPACK, so it performs much slower as the images provided for GCP do not include them either. The conda install of numpy does include them, so it works faster. See here for some more discussion.

In summary, this isn’t really a ray/rllib error, but for future it is important to note that rllib (at least) is heavily bottlenecked by numpy if numpy is not running efficiently as expected.

sangcho · December 8, 2020, 6:16am

@sven1977 It’ll be nice to explicitly mention this in the doc if you didn’t yet!

sven1977 · December 8, 2020, 11:33am

Thanks for sharing this important finding. We are aware that some bottlenecks in RLlib are caused by “out-of-graph” (i.e. mostly numpy) code. The fact that it’s 5 times slower is scary, though. I didn’t know that. I’m reposting the solution for Linux here:

Use np.show_config() to find out whether “blas” is installed.
If not, fix this via:

sudo pip3 uninstall numpy 
sudo apt-get install build-essential python-dev python-setuptools libboost-python-dev libboost-thread-dev -y
sudo pip3 install numpy

Topic		Replies	Views
Rllib runs UNBELIEVABLY slow on windows, even on a basic cartpole environment RLlib	2	404	November 17, 2021
Rollout workers spend too much time on set_weights() RLlib	1	288	November 30, 2022
RLlib slows down when gpu available but not used RLlib	0	352	April 7, 2021
RLlib, PyTorch and Mac M1 GPUs: No available node types can fulfill resource request RLlib	11	3999	February 29, 2024
[RLlib] Ray trains extremely slow when learner queue is full RLlib	7	2168	May 3, 2021

Magnitude slower performance on GCP Cluster vs. macbook pro

Related topics