How to load from check_point and call the environment

NDR008 · May 13, 2023, 11:29am

Between:

Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m working on my master thesis, which is about applying reinforcement learning to a PlayStation 1 game.
This is a 30min summary of what I did:

But here is the situation.
My gymnasium environment makes a TCP connection to an emulator upon instantiating the environment.

I originally had a lot of trouble to get this working, but here was my final workaround:

my_config = DEFAULT_CONFIG_DICT
my_config["interface"] = MyGranTurismoRTGYM
my_config["time_step_duration"] = 0.05
my_config["start_obs_capture"] = 0.05
my_config["time_step_timeout_factor"] = 1.0
my_config....

def env_creator(env_config):
  env = gymnasium.make("real-time-gym-v1", config=my_config)
  return env  # return an env instance

from ray.tune.registry import register_env
register_env("gt-rtgym-env-v1", env_creator)

ray.init()

algo = (
    PPOConfig()
    .environment(
        env="gt-rtgym-env-v1",
        disable_env_checking=True,
        render_env=False,
        )
...
    .build()
)

# Some training loops and then..

path_to_checkpoint = algo.save()

And things work well, the agent is learning and getting better.

Now, I wish to reload this training using (based on the documentation site shows this example):

my_config = DEFAULT_CONFIG_DICT
my_config["interface"] = MyGranTurismoRTGYM
my_config["time_step_duration"] = 0.05
my_config["start_obs_capture"] = 0.05
my_config["time_step_timeout_factor"] = 1.0

def env_creator(env_config):
  env = gymnasium.make("real-time-gym-v1", config=my_config)
  return env  # return an env instance

from ray.tune.registry import register_env
register_env("gt-rtgym-env-v1", env_creator)
ray.init()

algo = Algorithm.from_checkpoint("C:/PPO_gt-rtgym-env-v1_2023-05-13_00-31-13_kmaa6ii/checkpoint_000161")

episode_reward = 0
terminated = truncated = False
obs, info = env.reset()

while not terminated and not truncated:
    action = algo.compute_single_action(obs)
    obs, reward, terminated, truncated, info = env.step(action) # how to access the same env that the algo is re-attempting to create?

    episode_reward += reward

I cannot use the above method, because, in my case algo = Algorithm.from_checkpoint("C:/PPO_gt-rtgym-env-v1_2023-05-13_00-31-13_kmaa6ii/checkpoint_000161") re-instantiates an environment (and I see the agent trying to connect to the emulator via TCP).

Then obviously obs, info = env.reset() does not work, since there is no env.
If I attempt to do: env = gymnasium.make("real-time-gym-v1") it will also obviously also not work since this will try to instantiate a second environment (which will try to connect to another emulator).

So - question here is:

Is there a way for me to directly use the environment restored algorithm restore_algo’s_enviornment.reset()?
Is there a way for me to drop the restored environment to make use of my other one?

FYI I tried seraching a lot, on this, but it s a bit more challenging with changes in the API.

NDR008 · May 13, 2023, 9:04pm

I have also since tried loading the policy instead of the full algorithm:

from ray.rllib.policy.policy import Policy

my_restored_policy = Policy.from_checkpoint("C:/Users/nadir/ray_results/PPO_gt-rtgym-env-v1_2023-05-13_00-31-13_kmaa6ii/checkpoint_000161")

env = gymnasium.make("real-time-gym-v1", config=my_config)

terminated = truncated = False
obs, info = env.reset()

while not terminated and not truncated:
    action = my_restored_policy.compute_single_action(obs)
    obs, reward, terminated, truncated, info = env.step(action)
    episode_reward += reward

I get the following error:

AttributeError: 'dict' object has no attribute 'compute_single_action'

NDR008 · May 13, 2023, 9:37pm

New attempt… to remove the environment…

#1 Load the algorithm again…

from ray.rllib.algorithms.algorithm import Algorithm
import ray
from myRTClass import MyGranTurismoRTGYM, DEFAULT_CONFIG_DICT
import gymnasium


my_config = DEFAULT_CONFIG_DICT
my_config["interface"] = MyGranTurismoRTGYM
my_config["time_step_duration"] = 0.05
my_config["start_obs_capture"] = 0.05
my_config["time_step_timeout_factor"] = 1.0
my_config["act_buf_len"] = 3
my_config["reset_act_buf"] = False
my_config["benchmark"] = True
my_config["benchmark_polyak"] = 0.2

def env_creator(env_config):
  env = gymnasium.make("real-time-gym-v1", config=env_config)
  return env  # return an env instance

from ray.tune.registry import register_env
register_env("gt-rtgym-env-v1", lambda config: env_creator(my_config)) # better way

ray.init()

# env = gymnasium.make("real-time-gym-v1")
algo = Algorithm.from_checkpoint("C:/Users/nadir/ray_results/PPO_gt-rtgym-env-v1_2023-05-13_00-31-13_kmaa6ii/checkpoint_000161")
algo.stop()

Get the policy / model to extract the obs_space and action_space:

policy = algo.get_policy()
model = policy.model

Create a new config:

new_config = {
    # Indicate that the Algorithm we setup here doesn't need an actual env.
    "env": None,
    "observation_space": model.obs_space,
    "action_space": model.action_space,
    # ...
}

change the loaded algo’s config:

algo.config = new_config

Manually load an environment:

env = gymnasium.make("real-time-gym-v1", config=my_config)

Run things the classic gymnasium way:

episode_reward = 0
terminated = truncated = False
obs, info = env.reset()

while not terminated and not truncated:
    action = algo.compute_single_action(obs)
    obs, reward, terminated, truncated, info = env.step(action)
    episode_reward += reward

Errors:

   1486 # `unsquash_action` is None: Use value of config['normalize_actions'].
   1487 if unsquash_action is None:
-> 1488     unsquash_action = self.config.normalize_actions
   1489 # `clip_action` is None: Use value of config['clip_actions'].
   1490 elif clip_action is None:

AttributeError: 'dict' object has no attribute 'normalize_actions'

I guess I lost the pre-processing this way…

NDR008 · May 15, 2023, 10:58am

Any ideas anyone?
I would have thought that deploying the model for inference/evaluation that has already been trained is a key purpose of all of this.

arturn · May 15, 2023, 11:09pm

Hi @NDR008,

Cool work!
config dicts are legacy. We started deprecating them because RLlib needs to do quite a bit of validation and a couple of other reasons.
Since you are using PPO (and PPOConfig), you can turn your new_config into a PPOConfig with PPOConfig.from_dict(new_config).

Let me know how it goes!

I would have thought that deploying the model for inference/evaluation that has already been trained is a key purpose of all of this.

Indeed. Our own abstractions (Policy/Model) have made this a little harder than need be in the past though. Keep your eyes open for the new RLModules API (RL Modules (Alpha) — Ray 2.8.0) - one of our key reasons for going for this API is to make inference and serving easier.

Have a great day!

arturn · May 15, 2023, 11:29pm

NDR008:

I have also since tried loading the policy instead of the full algorithm:

from ray.rllib.policy.policy import Policy

my_restored_policy = Policy.from_checkpoint("C:/Users/nadir/ray_results/PPO_gt-rtgym-env-v1_2023-05-13_00-31-13_kmaa6ii/checkpoint_000161")

env = gymnasium.make("real-time-gym-v1", config=my_config)

terminated = truncated = False
obs, info = env.reset()

while not terminated and not truncated:
    action = my_restored_policy.compute_single_action(obs)
    obs, reward, terminated, truncated, info = env.step(action)
    episode_reward += reward

This is an unintuitive part of RLlib.
Policy.from_checkpoint() allows loading algorithm checkpoints. In this case it will return a dict of policies. You can grab the single policy from that dict and run inference with it.

arturn · May 15, 2023, 11:31pm

And a heads up: Policy does not hold a preprocessor and does not normalize/unsquash anything.
So be aware that usually you should use Algorithm(in this case PPO) to compute actions!

NDR008 · May 18, 2023, 5:01pm

How can I use the algorithm to computer the action?

arturn · May 18, 2023, 6:40pm

The Algorithm class has almost the same API.
You can use Algorithm.compute_single_action or Algorithm.compute_actions.

NDR008 · May 21, 2023, 11:53am

I load my environment config and re-register it:

if not debugAsGym:
    def env_creator(env_config):
        env = gymnasium.make("real-time-gym-v1", config=env_config)
        return env  # return an env instance

    from ray.tune.registry import register_env
    register_env("gt-rtgym-env-v1", lambda config: env_creator(my_config))

I then load the checkpoint:

from ray.rllib.algorithms.algorithm import Algorithm
algo = Algorithm.from_checkpoint("C:/Users/mrX/ray_results/PPO_gt-rtgym-env-v1_2023-05-19_07-37-37z3d6v2w2/checkpoint_002000")

Then Ray re-loads things including a connection to my environment.

(RolloutWorker pid=12400) GT Real Time instantiated
(RolloutWorker pid=12400) GT AI Server instantiated for rtgym
(RolloutWorker pid=12400) still simple reward system
(RolloutWorker pid=12400) starting up on localhost port 9999
(RolloutWorker pid=12400) Waiting for a connection
(RolloutWorker pid=12400) Connection from ('127.0.0.1', 57007)
2023-05-21 13:50:06,340	INFO trainable.py:172 -- Trainable.setup took 11.294 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.

But I do not understand how to use the algorithm to take actions…

algo.compute_single_action()

Leads to:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
j:\git\TensorFlowPSX\Py\rrlib_experiments_PPO_mode4_Supra_Drag.ipynb Cell 14 in 1
----> 1 algo.compute_single_action()

File c:\Users\mrX\anaconda3\envs\GTAI2\lib\site-packages\ray\rllib\algorithms\algorithm.py:1526, in Algorithm.compute_single_action(self, observation, state, prev_action, prev_reward, info, input_dict, policy_id, full_fetch, explore, timestep, episode, unsquash_action, clip_action, unsquash_actions, clip_actions, **kwargs)
   1524     observation = input_dict[SampleBatch.OBS]
   1525 else:
-> 1526     assert observation is not None, err_msg
   1528 # Get the policy to compute the action for (in the multi-agent case,
   1529 # Trainer may hold >1 policies).
   1530 policy = self.get_policy(policy_id)

AssertionError: Provide either `input_dict` OR [`observation`, ...] as args to Trainer.compute_single_action!

I cannot pass the environment’s observation to it, since my environment session was already triggered while re-loading the algorithms

mannyv · May 21, 2023, 4:54pm

Hi @NDR008,

If this was causing me such an issue for so long this is what I would do.

Create a Dummy environment and register it with the same name as the one used for training.

Restore from checkpoint. This will now create and reset dummy environments. The rllib RandomEnv is a good candidate here.

Create my real environment and then compute actions in a loop as. you currently have.

arturn · May 21, 2023, 7:03pm

manny is right, you’ll need to hack this for now.
A dummy environment sounds straightforward.

NDR008 · May 21, 2023, 7:14pm

Will the dummy env have to have the same observation space / action space?

Could you give me some boiler code on how to use the RandomEnv?

Thanks in advance.

arturn · May 21, 2023, 11:12pm

@NDR008 Yes.
You can import the random env from here: ray/random_env.py at master · ray-project/ray · GitHub
The file also contains two children of random env that show how to use it.

Topic		Replies	Views
Load checkpoint from tune experiment when using custom environment not registered RLlib	7	259	September 7, 2023
Interfacing an RL-LIB agent with a non gym env RLlib	5	509	March 13, 2022
[check_env] Custom Gymnasium environment returns error RLlib	2	623	February 6, 2024
Custom Gym Environment NaN RLlib	0	286	June 16, 2023
[RLlib] Visualise custom environment RLlib	18	4055	March 30, 2021

How to load from check_point and call the environment

Related topics