Hi all, I am trying to train my deep RL algorithm implemented with rllib. It runs fine on my macbook pro (13 inch, 2018, 4 cores/8 threads) but is, as expected, pretty slow. In order to speed it up a lot, I have been using a ray cluster on GCP.
However, this cluster is an order of magnitude slower than my macbook! The cluster is set to have ~18 cores; for reference, the core speeds listed by GCP are roughly the same as my macbook (or better). Despite this, getting to the first postprocessing phase of the training takes about 5 minutes using all 18 cores, full load. On my macbook, getting to this phase using only 1 worker takes about 1 minute - all training parameters are the same on the two use cases. Using GPU does not help.
From the ray dashboard, the bottleneck is happening under
RolloutWorker.par_iter_next(), which is same bottleneck on my macbook. However, this should be much faster on the cloud cluster. Any advice?