Checkpoint frequency is not clear

It’s not entirely clear to me how checkpoint_freq interacts with some of the rllib parameters, such as num_workers and other parallel parameters, rollout_fragment_length, and train_batch_size. I’ve run some long experiments and believed I was creating regular checkpoints, but after finishing, I realized that it was not checkpointing. This was particularly irksome for some of my runs which did not end “cleanly”, so there was no checkpoint at all (I have set checkpoint_at_end to True).

My experiments run for 1 million episodes with a hard horizon of 200. I’m on a compute node with 72 processors, so I have set num_workers to 71, rollout_fragment_length is 200, and train_batch_size is 1000. I set my checkpoint_freq to 100_000, expecting to get 11 checkpoints (1 for each step along the way and a final one at the end). However, I don’t get a single checkpoint except for the one at the end (if my experiment ended before the compute node died).

This is super strange, and the API docs are vague because it’s not obvious to me what constitutes a “training iteration”. So I’ve been playing around with the parallel and batch config, and here are some numbers:

+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- + 
| episodes total | checkpoint freq | horizon | workers | rollout fragment length | training batch size | checkpoint at end | resulting number of checkpoints | 
+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- + 
| 1e6            | 100             | 1       | 71      | 200                     | 1000                | True              | 1                               |
| 1e6            | 10              | 1       | 71      | 200                     | 1000                | True              | 2                               |
| 1e6            | 5               | 1       | 71      | 200                     | 1000                | True              | 3                               |
| 1e6            | 1               | 1       | 71      | 200                     | 1000                | True              | 13                              |
| 10             | 1               | 1       | 1       | 200                     | 1000                | True              | 1                               |
| 100            | 1               | 1       | 1       | 200                     | 1000                | True              | 1                               |
| 10             | 1               | 200     | 1       | 200                     | 200                 | False             | 2                               |
| 100            | 1               | 200     | 1       | 200                     | 200                 | False             | 5                               |
| 100            | 1               | 200     | 1       | 200                     | 1000                | False             | 4                               |
| 100            | 1               | 200     | 71      | 200                     | 1000                | False             | 2                               |
+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- +

It’s not clear to me exactly how to interpret these numbers into a usable understanding for setting checkpoint_freq, but I provide them here because maybe it is clear to someone else.

Anyways, does anyone have any insights on how to understand checkpoint_freq? For my experiments, I’m just going to set it to 1, but it would be nice to have a better understanding.

Hi @rusu24edward,

Interesting results. What is your tune stopping criteria? When does your environment return done for an episode? Is it after a fixed or variable number of steps? Which RL algorithm are you using to train?

Hi @mannyv, the tune stopping criteria is the number of episodes. The env does not return done, which is why I have a fixed horizon. These results are all using A2C.

Hey @rusu24edward , indeed strange. This could also be a Tune bug.
An iteration for A2C would actually just be: Gather the train_batch_size in data (concat’d from the n workers, which are each producing rollout chunks of 200 len) and do the update, then broadcast the weights back to all workers.

Also, what’s your min_iter_time_s value and how long are your trials (1M episodes) on average?

Hi @sven1977, min_iter_time_s is the default, which I believe is 30 seconds. My trials take around 2.5 hours to fully run.

Could you provide a self-sufficient reproduction script? It’s hard to tell what exactly is causing the checkpoints not to get created from this distance.
Thanks @rusu24edward !

As a software developer, I appreciate the value of reproducible scripts. However, this is difficult to do for many users, especially when using a software like RLlib which boasts integration with so many other packages. This is made even more challenging when running in a distributed mode because computer architecture and hardware means so much more. I’m sure you know all this already, and I’m saying it hear to show that I appreciate the struggle. I am not able to provide a self-sufficient script because it’s not valuable enough for me to go through all the work of reproducing this on an easier environment. This is a small bug for me, an annoyance, and not critical to my workflow.

Perhaps if ray provided debug logs, I could run it with debug on and send you the logs, but that’s the best I can do for this problem.