How can I deploy my reinforcement learning model trained with tune using the new API?

I trained a reinforcement learning model using ray.tune and the PPO algorithm. A series of checkpoints were generated. When I tried to restore the rlmodule from the checkpoints using the example method and perform inference, I found that the accumulated reward was much smaller than the episode_return_mean given in progress.csv. Here is the script I used to apply the policy:

    rl_module = RLModule.from_checkpoint(os.path.join(
            agentfile+'/checkpoint_000'+str(checkpoint)+'/',
            "learner_group",
            "learner",
            "rl_module",
            DEFAULT_MODULE_ID,
        )
    )

    env = gym.make(config["env"], config=config["env_config"])

    obs, info = env.reset()

    num_episodes = 0
    max_episodes = 9999
    episode_return = 0.0
    max_reward = 0

    while num_episodes < max_episodes:
        input_dict = {Columns.OBS: torch.from_numpy(obs).unsqueeze(0)}

        rl_module_out = rl_module.forward_inference(input_dict)

        action_dist_params = rl_module_out["action_dist_inputs"][0].numpy()
        greedy_action = np.clip(
            action_dist_params[0:22], 
            a_min=env.action_space.low[0],
            a_max=env.action_space.high[0],
        )

        obs, reward, terminated, truncated, _ = env.step(greedy_action)
        episode_return += reward


        if terminated or truncated:

            if to_save and episode_return > max_reward:
                max_reward = episode_return

            print('========')
            print(f"Episode done: Total reward = {episode_return}")
            obs, info = env.reset()
            num_episodes += 1
            episode_return = 0.0

In fact, in the old API, my environment can get reasonable rewards accumulation through compute_action. However, in the new version, compute_single_action is disabled. I noticed that when SingleAgentEnvRunner performs sampling, whether during env reset or step, the Connector is frequently used. This means that the action_dist obtained by forward_inference needs to be processed multiple times before it can be used as the input of the env:

#single_agent_env_runner.py Line 319  

                # Module-to-env connector.
                to_env = self._module_to_env(
                    rl_module=self.module,
                    batch=to_env,
                    episodes=episodes,
                    explore=explore,
                    shared_data=shared_data,
                    metrics=self.metrics,
                    metrics_prefix_key=(MODULE_TO_ENV_CONNECTOR,),
                )

Is this the reason why I can’t deploy the strategy correctly? Is there a more elegant solution than calling _module_to_env?

Thanks so much!

I’ve got working inference code up here. classes/inference_helpers.py should have what you’re looking for.

1 Like

Thanks MCW_Lad !!!

I also found a less elegant way, but it gets the exact result:

    new_ppo = Algorithm.from_checkpoint(checkpoint_dir)

    config = new_ppo.get_config()

    env_runner_group = new_ppo.env_runner_group
    local_env_runner = env_runner_group.local_env_runner

    local_env_runner.config = config
    local_env_runner.make_env()

    episodes = local_env_runner.sample(
        num_episodes=1,
    )

    print(local_env_runner.get_metrics())


    local_env_runner.env.env.envs[0].env.env.CUSTOM_Function()

Hopefully there will be a more official package that implements this functionality.

Regards!!!

1 Like

I am not entirely sure if its relevant for your case. If not at least its a nice to know, do you know that the episode_return_mean is smoothed by config.metrics_num_episodes_for_smoothing? See the topic I just have posted:

In short the min/mean/max you obtain by using local_env_runner.get_metrics() are from the last metrics_num_episodes_for_smoothing sampled episodes - not bound to an iteration.
Furthermore (depending on your ray version), restoring metrics is broken, see [RLlib] Checkpoint metrics loading with Tune is broken in 2.47.0 · Issue #53877 · ray-project/ray · GitHub. In your case I think the smoothing from the old episodes (if your windows reaches there) can be off / lost. So possibly you only get the smoothed value from after you loaded the checkpoint.

Maybe you have to cross check if things are maybe correct but not logged like you would expect.
Cheers, and good luck.