Rollout workers spend too much time on set_weights()

jelle · October 20, 2022, 12:58am

Our team has been using RLLib since this summer. While we hoped that using ray as a backend would speed up our training, when compared to other one-off github repos created by other researchers, our training was 10x slower to finish an iteration if running on RLLib. We did our due diligence to make sure our timestep_per_iteration, batch_size, num_envs etc are “equivalent” as much as possible…

While troubleshooting, we realized that most of our workers (each uses default 1CPU to collect samples) are spending most of their time at set_weights().

This seemed a bit odd to me, since the code itself is fetching dictionaries via ray.get() and updating the dictionaries, which shouldn’t take this long from what I’m understanding. We have num_workers: 60, num_gpus: 1 on a 64core machine with 1 GPU; num_envs_per_worker: 4 (anything above seems to really risk crashing the machine). Our Q and policy models are FC, `[2048, 2048, 2048, 2048]'. Training intensity was set to 1000. I wish I could provide a replicable example, but we’re currently using custom algorithm (inheriting Ray’s SAC) and custom environment (based on OpenAIGym), but I was hoping to get a guidance based on this description. Thank you!

arturn · November 30, 2022, 8:58pm

That is, by our standards, not the smallest model. But since you are doing this on a single machine, there is no need for serialization/deserialization when calling ray.get() here - the weights will be in memory already. Even though the model is a little on the larger side, the time to set weights should be ok. Can you please access RLlib’s logs with TB and have a look at the synch_weights_time_ms graph and compare that to other graphs? Is it below 10ms? It’s obviously a time-consuming operation.

Topic		Replies	Views
Reserve workers on GPU node for trainer workers only RLlib	7	1108	June 3, 2022
RLlib perform worse when rollout_worker/env_runner increased? Debugging and performance tuning	0	22	November 1, 2024
[RLlib] GPU performance in rollout.py RLlib	2	497	March 31, 2021
My Ray programs stops learning when using distributed compute RLlib	10	1079	August 16, 2022
Very slow gradient descent on remote workers RLlib	14	2438	June 8, 2021

Rollout workers spend too much time on set_weights()

Related topics