Evaluate trained model on long episodes

I have a RLlib model that I’ve successfully trained on a custom environment, and now I’m looking to evaluate that model more comprehensively. I tried using a slightly modified version of rollout.py, but it didn’t allow for parallelization (as discussed here). I then tried using the new parallel evaluation implementation provided by Trainer._evaluate, and it seemed to work well until I scaled up the episode length. Using long episodes (~50M - 100M environment steps per episode) caused slowly increasing memory use until my machine crashed.

I don’t think my custom environment is the problem, I can run long episodes without the RL agent that have minimal memory use is and the memory use does not appear to grow with the number of environment steps executed.

Digging into Trainer._evaluate, I see that it uses RolloutWorker.sample to run the episodes, which constructs and returns a sample batch. That batch will scale with the number of steps in an episode, and could be part of the problem. I’m not training the model, and I don’t need to interact with most of the data that would be included in that sample batch. All I really need is the total episode reward, and possibly the number of steps in the episode.

I believe that SimpleListCollector is involved in building this batch of experience, so it seems like one way of getting the logic that I want would be to sub-class SampleCollector (as noted here in the docs). However, it seems like changing the behavior provided by SimpleListCollector might change what data is passed to the model as it is executing the rollout.

Is there a better way to get parallelized evaluation that can also handle long episodes?

I think I found a solution that works well for my use case. I started with a basic env rollout, wrapped it up in a class decorated with ray.remote, then fiddled around with things to get the rollout working nicely with my model (I’m using the RLlib LSTM wrapper, toggled by the use_lstm=True config option).

from ray.tune.registry import ENV_CREATOR, _global_registry, get_trainable_cls

@ray.remote
class Evaluator:
    def __init__(self, args, config):
        cls = get_trainable_cls(args.run)
        self.agent = cls(env=args.env, config=config)
        self.agent.restore(args.checkpoint)
        self.env = _global_registry.get(ENV_CREATOR, args.env)(config["env_config"])

    def rollout(self):
        episode_reward = 0
        step_count = 0
        done = False
        obs = self.env.reset()
        action = np.zeros(self.env.action_space.shape, dtype=np.float32)
        state = self.agent.get_policy().get_initial_state()
        reward = 0
        while not done:
            action, state, _ = self.agent.compute_action(
                observation=obs,
                state=state,
                prev_action=action,
                prev_reward=reward,
            )
            obs, reward, done, info = self.env.step(action)
            episode_reward += reward
            step_count += 1
        self.env.reset()
        return episode_reward, step_count

This gets used in a small loop that’s similar to what’s seen in Trainer._evaluate:

import math

ray.init()

output = []
actors = [Evaluator.remote(args, config) for _ in range(6)]
for batch in range(math.ceil(args.episodes / 6)):
    results = [a.rollout.remote() for a in actors]
    values = [ray.get(r) for r in results]
    output.extend(values)

ray.shutdown()

args comes from something like ray.rllib.rollout.get_parser and config is the standard RLlib config dictionary.

3 Likes

Hey @cvanoort , yeah, I think the main problem is that our built-in evaluation mechanism forces the evaluation_config to have batch_mode=complete_episodes, such that we can run n episodes (n according to evaluation_num_episodes). In your case (Mio of timesteps per episode), this completely breaks :slight_smile: .

You could also specify your own custom evaluation function, bypassing RLlib’s built-in evaluation logic as described above. See this example here for more details:
ray/rllib/examples/custom_eval.py

Hi @sven1977, thanks for the reply. I think that batch_mode="complete_episodes" is the correct behavior for my use case. My custom environment is a simulation, I’m evaluating the agent over fixed length windows of simulation time (1 week of wall clock time by the simulation clock), and a given amount of simulation time may have a different number of steps (agent interactions) across episodes. The bigger problem seems to be that the built-in evaluation functions collect a batch of experience like one might expect during training (i.e. a dictionary containing lists of actions, rewards, observations, etc.).

I considered using custom_eval_function, but it was unclear to me how to modify the example function in ray/rllib/examples/custom_eval.py to get the behavior I was looking for. In particular, line 116 of custom_eval.py uses collect_episodes and eval_workers, which I’m guessing are ray.rllib.evaluation.RolloutWorkers and would encounter the same issue with long evaluation episodes.

I’ve had a bit more time to work with the solution that I posted above, and I think that it’s doing what I need. I’ve marked it as a solution, in case anyone else runs into similar issues.

1 Like