What is the "right" way to train a parameterized environment?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m training an agent on an external simulation engine for which I’ve made a custom environment. This environment takes 2 parameters -

  • number of steps in an episode and
  • amount of time each step takes.

I need to train the agent on multiple values for each parameter.

Up to now, I’ve been saving a checkpoint at the end of a run, build a new PPOConfig, load the checkpoint, then call train() all in a nested loop.

Conceptually, something like this:

train_config = {}
steps = [800, 600, 400]
times = [5, 3, 1]
checkpoint = None

for step in steps:
    for t in times:
    
        train_config['steps'] = step
        train_config['time']=t

        pulse = PPOConfig().environment(env=Pulse, env_config=train_config).build()
        if checkpoint is not None:
            pulse.restore(checkpoint)

        while True:
            result = pulse.train()
            if result['episode_reward_max'] > steps * 0.095:
                break

        checkpoint = pulse.save_checkpoint()
    
type or paste code here

That seems to work OK, I think, but I can’t help thinking that I’m doing too much manually. There has to be a better way.

To that end, I read on the forum here that running through Tune is the “right” way, so I changed my code to this:

train_config = {}
steps = [800, 600, 400]
times = [5, 3, 1]
checkpoint = None

for step in steps:
    for t in times:
    
        train_config['steps'] = step
        train_config['time']=t

        pulse = PPOConfig().environment(env=Pulse, env_config=train_config)

        analysis =tune.run(
            "PPO",
            name= "new_api_loop",
            config = pulse,
            restore=checkpoint,
            stop = {"env_runners/episode_return_max" : step * 0.095},
            checkpoint_at_end=True,
        )
        checkpoint = analysis.get_last_checkpoint().path

So, which of those 2 methods is better and, more importantly, is there some other way I should be training? I would really love to get rid of that nested loop but I can’t figure out how to initialize the environment with differing parameters.

Thanks

Update:

More testing reveals that option 2 above, running through Tune, seems to overwrite the environment configuration passed in to the PPOConfig, thus rendering it useless for this application.

In short, it seems like calling algorithm.train() and manually checking the stop condition is the only way.