Error creating RLPredictor using restored checkpoint

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi experts.

Get the following error when restoring checkpoint for a DQN model using Ray 2.1.0:

Traceback (most recent call last):
File “/home/stefan/PycharmProjects/RLProjects/rl_offline_trainer/inference_app.py”, line 61, in
predictor = RLPredictor.from_checkpoint(checkpoint)
File “/home/stefan/anaconda3/envs/py38_ray2.1/lib/python3.8/site-packages/ray/train/rl/rl_predictor.py”, line 63, in from_checkpoint
policy = checkpoint.get_policy(env)
File “/home/stefan/anaconda3/envs/py38_ray2.1/lib/python3.8/site-packages/ray/train/rl/rl_checkpoint.py”, line 42, in get_policy
return Policy.from_checkpoint(checkpoint=self)[“default_policy”]
File “/home/stefan/anaconda3/envs/py38_ray2.1/lib/python3.8/site-packages/ray/rllib/policy/policy.py”, line 256, in from_checkpoint
policy_state = pickle.load(f)
File “/home/stefan/anaconda3/envs/py38_ray2.1/lib/python3.8/site-packages/ray/_private/serialization.py”, line 89, in _actor_handle_deserializer
return ray.actor.ActorHandle._deserialization_helper(serialized_obj, outer_id)
File “/home/stefan/anaconda3/envs/py38_ray2.1/lib/python3.8/site-packages/ray/actor.py”, line 1281, in _deserialization_helper
return worker.core_worker.deserialize_and_register_actor_handle(
File “python/ray/_raylet.pyx”, line 2137, in ray._raylet.CoreWorker.deserialize_and_register_actor_handle
File “python/ray/_raylet.pyx”, line 2106, in ray._raylet.CoreWorker.make_actor_handle
File “/home/stefan/anaconda3/envs/py38_ray2.1/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 522, in load_actor_class
actor_class = self._load_actor_class_from_gcs(
File “/home/stefan/anaconda3/envs/py38_ray2.1/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 617, in _load_actor_class_from_gcs
class_name = ensure_str(class_name)
File “/home/stefan/anaconda3/envs/py38_ray2.1/lib/python3.8/site-packages/ray/_private/utils.py”, line 289, in ensure_str
assert isinstance(s, bytes)
AssertionError

Use Ray Tune to execute several trials with option to save checkpoint at end using the following:

    # create tuner
    tuner = Tuner(

        # trainer
        trainer,

        # create tune configuration
        tune_config=self.create_tune_config(
            search_algo=search_algo,
            scheduler=scheduler
        ),

        # hyper-parameters
        param_space=self.create_param_space(),

        # specify run configuration
        run_config=RunConfig(
            stop=dict(training_iteration=2),
            checkpoint_config=CheckpointConfig(checkpoint_at_end=True),
            verbose=3
        )
    )

    # run trials
    result_grid = tuner.fit()

Recreate best checkpoint and use it to create RLPredictor at which point the above error occurs:

    # recreate checkpoint
    checkpoint = Checkpoint.from_directory(path=checkpoint_path)

    # create RLPredictor from checkpoint - errors occurs when this executes
    predictor = RLPredictor.from_checkpoint(checkpoint)

From what I can tell the checkpoint folder contains all necessary artifacts. What am I doing wrong?

Thanks.
Stefan

Hi @steff,

The code looks good. Could you turn this into a GH issue with a complete repro script?

Cheers

Sure. Is there a web page that describes the steps?

1 Like

Hi @steff ,

Nothing special, I can write the steps down:

  • Go to official ray repo
  • Click issues, create issue
  • Fill out form, include repro script and probably description of what you expected to happen vs was is happing
  • Post link here for reference

Hey @steff, I have the same problem in 2.3.1, did you end up creating an issue for this or finding a resolution?

Created an issue for this. Here is the link: Cannot create RLPredictor using restored checkpoint in different Ray session · Issue #33995 · ray-project/ray · GitHub