Restore and serve from remote checkpoit

peterhaddad3121 · May 17, 2022, 2:30pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have used Ray Tune to train and sync checkpoints to S3.

I want to restore a PPO Agent with a checkpoint stored in S3.

The Trainable class in ray/trainable.py at ray-1.12.0 · ray-project/ray · GitHub

has the following code:

    def restore(self, checkpoint_path):
        """Restores training state from a given model checkpoint.
        These checkpoints are returned from calls to save().
        Subclasses should override ``load_checkpoint()`` instead to
        restore state.
        This method restores additional metadata saved with the checkpoint.
        `checkpoint_path` should match with the return from ``save()``.
        `checkpoint_path` can be
        `~/ray_results/exp/MyTrainable_abc/
        checkpoint_00000/checkpoint`. Or,
        `~/ray_results/exp/MyTrainable_abc/checkpoint_00000`.
        `self.logdir` should generally be corresponding to `checkpoint_path`,
        for example, `~/ray_results/exp/MyTrainable_abc`.
        `self.remote_checkpoint_dir` in this case, is something like,
        `REMOTE_CHECKPOINT_BUCKET/exp/MyTrainable_abc`

However, I am unsure the best practice for implementing when using the following Serve deployment:

from ray import serve
from starlette.requests import Request
import ray.rllib.agents.ppo as ppo

@serve.deployment(route_prefix="/cartpole-ppo")
class ServePPOModel:
    def __init__(self, checkpoint_path) -> None:
        self.trainer = ppo.PPOTrainer(
            config={
                "framework": "torch",
                "num_workers": 0,
            },
            env="CartPole-v0",
        )
        self.uses_cloud_checkpointing = True
        self.remote_checkpoint_dir = checkpoint_path
        
        self.trainer.restore(checkpoint_path)

    async def __call__(self, request: Request):
        json_input = await request.json()
        obs = json_input["observation"]

        action = self.trainer.compute_single_action(obs)
        return {"action": int(action)}

Are there best practices for doing this?

simon-mo · May 23, 2022, 7:55pm

Hi @peterhaddad3121, thank you for your question!

I want to share two recommendation here:

In the short term, your code looks great! It follows the pattern we recommend in our Serve RLlib Tutorial.
We have recently introduced Ray AI Runtime which unifies training → serving on top of Ray. Here’s an end to end example for serving RLlib models: Serving reinforcement learning policy models — Ray 3.0.0.dev0

Topic		Replies	Views
Tuner cannot restore the checkpoints! Ray Tune	10	864	November 20, 2023
PPO from checkpoint Checkpointing, Restoring	0	32	September 10, 2024
[Rllib] how to restore trainer from different checkpoint files when training on server and local RLlib	1	285	February 3, 2023
Restore checkpoint saved with client-server RLlib	7	742	August 2, 2022
Can't properly restore result trained with RLlib using Ray.train.Result RLlib	1	131	May 29, 2024

Restore and serve from remote checkpoit

Related topics