I trained a reinforcement learning model using ray.tune and the PPO algorithm. A series of checkpoints were generated. When I tried to restore the rlmodule from the checkpoints using the example method and perform inference, I found that the accumulated reward was much smaller than the episode_return_mean given in progress.csv. Here is the script I used to apply the policy:
rl_module = RLModule.from_checkpoint(os.path.join(
agentfile+'/checkpoint_000'+str(checkpoint)+'/',
"learner_group",
"learner",
"rl_module",
DEFAULT_MODULE_ID,
)
)
env = gym.make(config["env"], config=config["env_config"])
obs, info = env.reset()
num_episodes = 0
max_episodes = 9999
episode_return = 0.0
max_reward = 0
while num_episodes < max_episodes:
input_dict = {Columns.OBS: torch.from_numpy(obs).unsqueeze(0)}
rl_module_out = rl_module.forward_inference(input_dict)
action_dist_params = rl_module_out["action_dist_inputs"][0].numpy()
greedy_action = np.clip(
action_dist_params[0:22],
a_min=env.action_space.low[0],
a_max=env.action_space.high[0],
)
obs, reward, terminated, truncated, _ = env.step(greedy_action)
episode_return += reward
if terminated or truncated:
if to_save and episode_return > max_reward:
max_reward = episode_return
print('========')
print(f"Episode done: Total reward = {episode_return}")
obs, info = env.reset()
num_episodes += 1
episode_return = 0.0
In fact, in the old API, my environment can get reasonable rewards accumulation through compute_action. However, in the new version, compute_single_action is disabled. I noticed that when SingleAgentEnvRunner performs sampling, whether during env reset or step, the Connector is frequently used. This means that the action_dist obtained by forward_inference needs to be processed multiple times before it can be used as the input of the env:
#single_agent_env_runner.py Line 319
# Module-to-env connector.
to_env = self._module_to_env(
rl_module=self.module,
batch=to_env,
episodes=episodes,
explore=explore,
shared_data=shared_data,
metrics=self.metrics,
metrics_prefix_key=(MODULE_TO_ENV_CONNECTOR,),
)
Is this the reason why I can’t deploy the strategy correctly? Is there a more elegant solution than calling _module_to_env?
Thanks so much!