It’s not entirely clear to me how checkpoint_freq
interacts with some of the rllib parameters, such as num_workers
and other parallel parameters, rollout_fragment_length
, and train_batch_size
. I’ve run some long experiments and believed I was creating regular checkpoints, but after finishing, I realized that it was not checkpointing. This was particularly irksome for some of my runs which did not end “cleanly”, so there was no checkpoint at all (I have set checkpoint_at_end
to True).
My experiments run for 1 million episodes with a hard horizon
of 200. I’m on a compute node with 72 processors, so I have set num_workers
to 71, rollout_fragment_length
is 200, and train_batch_size
is 1000. I set my checkpoint_freq
to 100_000, expecting to get 11 checkpoints (1 for each step along the way and a final one at the end). However, I don’t get a single checkpoint except for the one at the end (if my experiment ended before the compute node died).
This is super strange, and the API docs are vague because it’s not obvious to me what constitutes a “training iteration”. So I’ve been playing around with the parallel and batch config, and here are some numbers:
+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- +
| episodes total | checkpoint freq | horizon | workers | rollout fragment length | training batch size | checkpoint at end | resulting number of checkpoints |
+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- +
| 1e6 | 100 | 1 | 71 | 200 | 1000 | True | 1 |
| 1e6 | 10 | 1 | 71 | 200 | 1000 | True | 2 |
| 1e6 | 5 | 1 | 71 | 200 | 1000 | True | 3 |
| 1e6 | 1 | 1 | 71 | 200 | 1000 | True | 13 |
| 10 | 1 | 1 | 1 | 200 | 1000 | True | 1 |
| 100 | 1 | 1 | 1 | 200 | 1000 | True | 1 |
| 10 | 1 | 200 | 1 | 200 | 200 | False | 2 |
| 100 | 1 | 200 | 1 | 200 | 200 | False | 5 |
| 100 | 1 | 200 | 1 | 200 | 1000 | False | 4 |
| 100 | 1 | 200 | 71 | 200 | 1000 | False | 2 |
+ -------------- + --------------- + ------- + ------- + ----------------------- + ------------------- + ----------------- + ------------------------------- +
It’s not clear to me exactly how to interpret these numbers into a usable understanding for setting checkpoint_freq
, but I provide them here because maybe it is clear to someone else.
Anyways, does anyone have any insights on how to understand checkpoint_freq
? For my experiments, I’m just going to set it to 1, but it would be nice to have a better understanding.