SAC trainer slows down drastically

Stale_neutrino · May 24, 2022, 2:07pm

Hey guys,

I’m training an SAC agent on a custom env and I’m using the mostly default config (see config bellow) for my agent. I noticed that after 80K steps my trainer slows down drastically only producing a new episode every hour. Is this normal ? or am I screwing up a config setting ?

config:

config['timesteps_per_iteration'] = 1000
config['learning_starts'] = 1000
config["num_gpus"] = 1

These are just output file for episode, but notice the time diff between each

EDIT:

Should I try setting the replay buffer cap to something smaller ?

arturn · May 27, 2022, 4:22pm

Hi @Stale_neutrino ,

This is definitely not normal. The three config settings you posted look just fine.
You can have a look at your buffer’s estimated size with buffer._est_size_bytes, which is also part of the buffers stats under the key “est_size_bytes”.

Another thing to look at to figure out if something is unhealthy is the ray dashboard.

Lastly: What version of ray are you using?

Stale_neutrino · May 27, 2022, 5:28pm

Hey @arturn, thanks for the feedback. I’ll keep an eye on my ray dash while training. My Ray version is 1.12.0

Stale_neutrino · May 27, 2022, 6:13pm

@arturn one more thing, I’m running Tune on my SAC agent and currently going through 10 tuning trials. Here’s the current memory usage. Does this seem normal ? Sadly I didn’t enable ray dash for this run . I’ll make new one, and once it’s done I’ll post the dash output.

Outputs from my terminal

Here’s the dash output, not sure why it’s not showing the PID of my current SAC

arturn · May 28, 2022, 2:38pm

Hey @Stale_neutrino ,

Is your environment taking super long to initialize? Because if not, 27 minutes of running the algorithm but having only 1 CPU at work means that you have no rollout workers interacting with environments.

Have theese screenshots been taken after your training drastically slows down? If so:
Have a look at your tensorboard and look at sampling or training times. If sampling times and training times stay low, then you can me almost entirely sure it’s your buffer filling up that is the issue here.

Cheers

Stale_neutrino · May 28, 2022, 3:45pm

Hey @arturn I have num workers = 0 since any time i have more than 0 I get a creation error. Regarding agent performance, here’s the sampling rate , the action processing and mean env wait time

Dumb question, how can I clear my buffer ?

Thanks !

arturn · May 29, 2022, 8:09pm

Thanks!

For the current nightly, you can set "capacity=<xy>" in your replay_buffer_config.
Training can also simply slow down because learning_starts timesteps/agent steps have been sampled and the training iteration functions starts to include gradient computation etc.

If you want to share more data, you can use tensorboard dev upload --logdir <dir>

Topic		Replies	Views
SAC Agent 'Forgets' During Training RLlib	5	293	September 13, 2022
Experiment slowing down after several hours of flawless training Ray Tune	5	526	June 21, 2023
PER Buffer throws KeyError during training of SAC RLlib	0	10	April 7, 2025
Estimated max memory usage for replay buffer is too large RLlib	1	427	June 1, 2023
SAC Training Performance Detirioration RLlib	3	285	July 5, 2022

SAC trainer slows down drastically

Related topics