Save model parameters on each checkpoint

I would like to save the model (.pb, .h5) parameters on each checkpoint as we would like
to compare the various stages of training outside of the ray/rllib framework and the
models are relatively small. It is not possible to know ahead of time how many iterations
are needed for training at the moment.

I have confirmed saving at the end of training works:

from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.trial import ExportFormat

tune.run(PPOTrainer,
config={“env”: “CartPole-v0”},
export_formats=[ExportFormat.MODEL, ExportFormat.H5, ExportFormat.CHECKPOINT],
local_dir=‘cart_outputs3’,
stop={“training_iteration”: 1}
)

PPO_CartPole-v0_91fad_00000_0_2021-07-14_09-14-54
├── checkpoint
│ ├── checkpoint
│ ├── model.data-00000-of-00001
│ ├── model.index
│ └── model.meta
├── events.out.tfevents.1626250494.velocity
├── model
│ ├── events.out.tfevents.1626250523.velocity
│ ├── saved_model.pb
│ └── variables
│ ├── variables.data-00000-of-00002
│ ├── variables.data-00001-of-00002
│ └── variables.index
├── params.json
├── params.pkl
├── progress.csv
└── result.json

(First problem is that the .h5 file is not created despite it being an available export option)

Now we use Tune:

results = tune.run(args.run, config=config, stop=stop, checkpoint_freq=2,
export_formats=[ExportFormat.MODEL, ExportFormat.H5],
num_samples=1, checkpoint_at_end=False)

But in this case nothing else appears in the checkpoints at all.
checkpoint_000002
├── checkpoint-2
└── checkpoint-2.tune_metadata

In a previous version of Ray (I think 0.8.0) - setting
,“checkpoint_freq”:2
,“checkpoint_at_end”: True
in the config and using run_experiments would create the model data under each checkpoint directory:
run_experiments({“EnvName”: myconfig})

So how can one save the model parameters (TensorFlow in this case) - to .pb or .h5 at each
checkpoint (the model is small) using ray.tune? Many thanks!

As an additional point, using ray==1.4.1 on Mac OS X. Is saving a model down in recoverable format (pb, hdf5) at checkpoints a supported feature? Is there any other information needed? I am kind of stuck at this point. Cheers.

If there is anyone with information about this would greatly appreciate it; I have tried several forums and scoured the internet for this simple question that seems so basic and essential. Quite a few queries about it but so no response anywhere. It would seem quite silly to have all this fantastic framework for training and tuning - but be unable to actually use the model trained outside of a ray actor that forces one to use the service/ports, etc. As I mentioned saving a model on a checkpoint as an option used to work in previous versions. I have tried this on Mac OS X and Linux and get the same result - a checkpoint only has the following files:
checkpoint-1479
checkpoint-1479.tune_metadata
despite specifying H5 and model on the input. No error is produced during training:

results = tune.run(args.run, 
                    config=config, 
                    stop=stop, 
                    checkpoint_freq=1, 
                    export_formats=[ExportFormat.MODEL, 
                           ExportFormat.H5, ExportFormat.CHECKPOINT], 
                    checkpoint_at_end=True
                )

I’ve been using Ray since the initial version and never had this issue - please help.
From a design standpoint I don’t think it would make sense to use the ‘results’ from the tuning - firstly one may not know how many iterations are needed ahead of time and may need to hit CTRL^C to stop the training. If we need to use results then all of that would be lost. Instead, the usual point of the checkpoint is to save the model so that recovery can start from an arbitrary point. The other objective is for inference: we may want to compare inference along different checkpoints - but we need to re-create the network model in tensoflow without the overhead of ray actors in the way.