Ray 2.9 can't load a checkpoint stored with Ray 2.5

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I finished training a single agent model with SAC on Ray 2.5.1 recently (on Ubuntu). It performs well and I want to use it for inference tasks going forward. It was stored using code like

    ray.init(storage = DATA_PATH)
    algo = "SAC"
    cfg = sac.SACConfig()
    cfg.framework("torch")
    cfg.checkpointing(export_native_model_files = True)
    ...
    algo = cfg.build()
    for iter in range(1, max_iterations+1):
        result = algo.train()
        algo.save(checkpoint_dir = DATA_PATH)

I have since done a routine system upgrade, which included upgrading Ray to 2.9.0. The upgrade went smoothly, but when I now run my inference code, a snippet of which looks like:

        ray.init()
        cfg = sac.SACConfig()
        cfg.framework("torch").exploration(explore = False)
        algo = cfg.build()
        algo.restore(checkpoint)

However, now when I run the inference program I get the following error:

Traceback (most recent call last):
  File "/home/starkj/projects/cda1/src/inference.py", line 221, in <module>
    main(sys.argv)
  File "/home/starkj/projects/cda1/src/inference.py", line 92, in main
    algo.restore(checkpoint)
  File "/home/starkj/miniconda3/envs/cda/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 577, in restore
    self.load_checkpoint(checkpoint_dir)
  File "/home/starkj/miniconda3/envs/cda/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 2341, in load_checkpoint
    checkpoint_data = Algorithm._checkpoint_info_to_algorithm_state(checkpoint_info)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/cda/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 2926, in _checkpoint_info_to_algorithm_state
    new_config = default_config.update_from_dict(state["config"])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/cda/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm_config.py", line 690, in update_from_dict
    setattr(self, key, value)
  File "/home/starkj/miniconda3/envs/cda/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm_config.py", line 3501, in __setattr__
    super().__setattr__(key, value)
AttributeError: property 'is_atari' of 'SACConfig' object has no setter

I have looked through the RLlib source code for an obvious problem, but it is not clear to me where the is_atari property could be involved. I’ve studied the updated user guides for using checkpoints in 2.9.0 and it seems that there should be no backward compatibility issues between these two versions. In fact, I confirmed my checkpoint’s rllib_checkpoint.json file shows checkpoint_version is 1.1, same as what my training program is now producing under Ray 2.9.0.

I should also point out that I’m using a custom model and registering it with ModelCatalog.register_custom_model(). With Ray 2.5.1 I was able to train & write checkpoints then read them for inference many times, so I suspect the custom model is not part of the problem.

What can I do to get around this problem, short of retraining my model from scratch? Thanks.

P.S. Stepping back a bit, I’m pretty confused about checkpoint generation philosophy in general, based on reading various user guide material. The stuff under the Ray.Train manual is pretty thorough and easy to follow, but (for a Pytorch guy) it only shows how to use raw torch.save() of the state dict. I have been using the rich Algorithm checkpoints so that I can start a new round of training from any checkpoint, not just use it for inference. But reading about checkpoints in RLlib user guide looks more like the pre-2.7 philosophy, as it never references the new Checkpoint or CheckpointConfig APIs. Is there something fundamentally different between using RLlib vs Train, or has the documentation just not fully caught up, or is there something I’m just not seeing?