What is the "right" way to train a parameterized environment?

mbusch-regis · July 12, 2024, 10:36pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m training an agent on an external simulation engine for which I’ve made a custom environment. This environment takes 2 parameters -

number of steps in an episode and
amount of time each step takes.

I need to train the agent on multiple values for each parameter.

Up to now, I’ve been saving a checkpoint at the end of a run, build a new PPOConfig, load the checkpoint, then call train() all in a nested loop.

Conceptually, something like this:

train_config = {}
steps = [800, 600, 400]
times = [5, 3, 1]
checkpoint = None

for step in steps:
    for t in times:
    
        train_config['steps'] = step
        train_config['time']=t

        pulse = PPOConfig().environment(env=Pulse, env_config=train_config).build()
        if checkpoint is not None:
            pulse.restore(checkpoint)

        while True:
            result = pulse.train()
            if result['episode_reward_max'] > steps * 0.095:
                break

        checkpoint = pulse.save_checkpoint()
    
type or paste code here

That seems to work OK, I think, but I can’t help thinking that I’m doing too much manually. There has to be a better way.

To that end, I read on the forum here that running through Tune is the “right” way, so I changed my code to this:

train_config = {}
steps = [800, 600, 400]
times = [5, 3, 1]
checkpoint = None

for step in steps:
    for t in times:
    
        train_config['steps'] = step
        train_config['time']=t

        pulse = PPOConfig().environment(env=Pulse, env_config=train_config)

        analysis =tune.run(
            "PPO",
            name= "new_api_loop",
            config = pulse,
            restore=checkpoint,
            stop = {"env_runners/episode_return_max" : step * 0.095},
            checkpoint_at_end=True,
        )
        checkpoint = analysis.get_last_checkpoint().path

So, which of those 2 methods is better and, more importantly, is there some other way I should be training? I would really love to get rid of that nested loop but I can’t figure out how to initialize the environment with differing parameters.

Thanks

mbusch-regis · July 15, 2024, 5:50pm

Update:

More testing reveals that option 2 above, running through Tune, seems to overwrite the environment configuration passed in to the PPOConfig, thus rendering it useless for this application.

In short, it seems like calling algorithm.train() and manually checking the stop condition is the only way.

Topic		Replies	Views
“env_config” in Tune Ray Tune	8	684	August 15, 2022
How to parallelize training multiple models Ray Tune	4	1164	April 5, 2021
How to use Ray Tune to tune environment parameters in RLLib in 2.7? RLlib	1	245	December 23, 2023
Raytune. Tuner Param_space reinforcement learning	1	173	January 17, 2024
Ray RLlib environment with Ray Tune parameters RLlib	3	297	June 2, 2021

What is the "right" way to train a parameterized environment?

Related topics