Checkpoint frequency is not clear

rusu24edward · April 19, 2021, 11:36pm

It’s not entirely clear to me how checkpoint_freq interacts with some of the rllib parameters, such as num_workers and other parallel parameters, rollout_fragment_length, and train_batch_size. I’ve run some long experiments and believed I was creating regular checkpoints, but after finishing, I realized that it was not checkpointing. This was particularly irksome for some of my runs which did not end “cleanly”, so there was no checkpoint at all (I have set checkpoint_at_end to True).

My experiments run for 1 million episodes with a hard horizon of 200. I’m on a compute node with 72 processors, so I have set num_workers to 71, rollout_fragment_length is 200, and train_batch_size is 1000. I set my checkpoint_freq to 100_000, expecting to get 11 checkpoints (1 for each step along the way and a final one at the end). However, I don’t get a single checkpoint except for the one at the end (if my experiment ended before the compute node died).

This is super strange, and the API docs are vague because it’s not obvious to me what constitutes a “training iteration”. So I’ve been playing around with the parallel and batch config, and here are some numbers:

+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- + 
| episodes total | checkpoint freq | horizon | workers | rollout fragment length | training batch size | checkpoint at end | resulting number of checkpoints | 
+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- + 
| 1e6            | 100             | 1       | 71      | 200                     | 1000                | True              | 1                               |
| 1e6            | 10              | 1       | 71      | 200                     | 1000                | True              | 2                               |
| 1e6            | 5               | 1       | 71      | 200                     | 1000                | True              | 3                               |
| 1e6            | 1               | 1       | 71      | 200                     | 1000                | True              | 13                              |
| 10             | 1               | 1       | 1       | 200                     | 1000                | True              | 1                               |
| 100            | 1               | 1       | 1       | 200                     | 1000                | True              | 1                               |
| 10             | 1               | 200     | 1       | 200                     | 200                 | False             | 2                               |
| 100            | 1               | 200     | 1       | 200                     | 200                 | False             | 5                               |
| 100            | 1               | 200     | 1       | 200                     | 1000                | False             | 4                               |
| 100            | 1               | 200     | 71      | 200                     | 1000                | False             | 2                               |
+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- +

It’s not clear to me exactly how to interpret these numbers into a usable understanding for setting checkpoint_freq, but I provide them here because maybe it is clear to someone else.

Anyways, does anyone have any insights on how to understand checkpoint_freq? For my experiments, I’m just going to set it to 1, but it would be nice to have a better understanding.

mannyv · April 20, 2021, 1:05am

Hi @rusu24edward,

Interesting results. What is your tune stopping criteria? When does your environment return done for an episode? Is it after a fixed or variable number of steps? Which RL algorithm are you using to train?

rusu24edward · April 20, 2021, 4:24pm

Hi @mannyv, the tune stopping criteria is the number of episodes. The env does not return done, which is why I have a fixed horizon. These results are all using A2C.

sven1977 · April 21, 2021, 9:38am

Hey @rusu24edward , indeed strange. This could also be a Tune bug.
An iteration for A2C would actually just be: Gather the train_batch_size in data (concat’d from the n workers, which are each producing rollout chunks of 200 len) and do the update, then broadcast the weights back to all workers.

Also, what’s your min_iter_time_s value and how long are your trials (1M episodes) on average?

rusu24edward · May 10, 2021, 3:15pm

Hi @sven1977, min_iter_time_s is the default, which I believe is 30 seconds. My trials take around 2.5 hours to fully run.

sven1977 · May 13, 2021, 7:32pm

Could you provide a self-sufficient reproduction script? It’s hard to tell what exactly is causing the checkpoints not to get created from this distance.
Thanks @rusu24edward !

rusu24edward · May 17, 2021, 7:35pm

As a software developer, I appreciate the value of reproducible scripts. However, this is difficult to do for many users, especially when using a software like RLlib which boasts integration with so many other packages. This is made even more challenging when running in a distributed mode because computer architecture and hardware means so much more. I’m sure you know all this already, and I’m saying it hear to show that I appreciate the struggle. I am not able to provide a self-sufficient script because it’s not valuable enough for me to go through all the work of reproducing this on an easier environment. This is a small bug for me, an annoyance, and not critical to my workflow.

Perhaps if ray provided debug logs, I could run it with debug on and send you the logs, but that’s the best I can do for this problem.

Topic		Replies	Views
PPO configuration parameters: num_rollout_workers & train_batch_size Configure Algorithm, Training, Evaluation, Scaling	1	334	November 2, 2023
PPO algorithms train buffer only collects the first fragment from each worker? RLlib	4	566	October 30, 2021
Bad inference after perfect training. What am I missing? RLlib	3	590	June 8, 2022
Custom checkpoints in RLLIB RLlib	1	116	December 23, 2023
[RLlib] batch size interpretation when training multiple policies RLlib	4	460	July 15, 2021

Checkpoint frequency is not clear

Related Topics