Set up environment for Optimization with a simulation software


I am currently trying to use Reinforcement Learning to Optimize some parameters in a simulation software.
Therefore I build a custom environment to do these simulation. Currently I am having a lot of trouble getting decent results.

For explanation, what I want to do:
Optimize a couple of values (one or two) for different boundary conditions. The boundary conditions are completely independent of each other and there are no real timesteps.

What I am currently doing:
I define the limits of my boundary conditions (=BC) in the environment. There the BC get initialized randomly. The agent then predicts an action according to these BC and I do a step in my environment. According to the Action, the Simulation calculates the output and a reward is calculated from that (Just a linear function, which is the abs of the distance from the desired output). After that I set done to true and in the reset functions the BC are again initialized randomly.
However I noticed the performance is very poorly.

As a demonstration:
I want to optimize the simulation for the values [range(0, 50), range(1000, 5000)]
BC: [11.2392, 3092.9291]
Action: [0.2]
Reward : 0.1
new BC: [2.1242, 1234.0212]

I am currently implementing that the BC get initilaized equidistant eg.:
BC: [0, 1000]
newBC: [10, 2000]
newBC: [20, 3000]

Are there any recommendations how to set up such a environment?
Should i set done=True after each BC?

Thanks in advance


Hey @SebastianBo1995 , does the agent actually see the boundaries (in the observations), so it can learn to act closely to/far away from the boundaries? I’m assuming the answer is yes, though.

It should learn either way, with setting done=True after each action (more like a contextual bandit) or not (episodic).

Which algo are you using and what’s your config?


thanks for the response.

Yes the agent sees all necessary boundary conditions in the observation, but the action. With the Action and the observation the simulation calculates an output.

I tried 2/3 different Algorithms. PPO, TD3 and SAC. The SAC-Agent however often diverges and outputs a nan action. I noticed the Agents get alot more stable with scaling the observation, probably because of the large scale difference of these e.g. one reaches from 0.8 to 1.8 and the other one from 500 to 8 000.

The reward function looks like this:

the PPO-Config is:
config[‘model’] = {‘fcnet_hiddens’: [100, 100], “fcnet_activation”: “relu”} # für PPO
config[‘clip_param’] = 0.2
config[‘num_sgd_iter’] = 10
config[‘sgd_minibatch_size’] = 32
config[‘num_workers’] = 16
config[‘num_envs_per_worker’] = 1
config[‘sample_async’] = False
config[‘lr’] = 0.001
config[‘num_gpus’] = 0
config[‘framework’] = ‘tf2’
config[‘rollout_fragment_length’] = 1
config[‘train_batch_size’] = 32
config[‘explore’] = True
config[‘normalize_actions’] = False
config[‘gamma’] = 0.99

the TD3:
same common as the PPO, additionals:
config[‘actor_hiddens’] = [100, 100]
config[‘actor_hidden_activation’] = ‘relu’
config[‘critic_hiddens’] = [100, 100]
config[‘critic_hidden_activation’] =‘relu’
config[‘learning_starts’] = 100
config[‘gamma’] = 0.95
config[‘normalize_actions’] = True
config[‘critic_lr’] = 0.001
config[‘actor_lr’] = 0.001
config[‘clip_rewards’] = False
config[‘clip_actions’] = True
config[‘timesteps_per_iteration’] = 1
config[‘twin_q’] = True
config[‘policy_delay’] = 2
config[‘smooth_target_policy’] = True
config[‘target_noise’] = 0.02
config[‘target_noise_clip’] = 0.5
config[‘n_step’] = 1
config[‘buffer_size’] = 1000000
config[‘target_network_update_freq’] = 0
config[‘tau’] = 0.0005


There you can see some Results with the TD3-Agent and different Train-Batchsizes. The agents aim is to get the output as close to 8 as possible. The mean is around that however the fluctuations are quite high

Hey @SebastianBo1995 ,I have met the same problem. In my simulator, the coordinate range of agent is 0~1000,and other parameters range from 0 to 10.

When PPO algorithm is used to iterate about 500 times, the action output of the neural network will become NaN. I use some tips mentioned in #8135 and #7923, but it seems useless.

Now I get some inspiration from you, so I’m going to reduce the coordinate range of agent to 0~10.

Hey @Glaucus-2G ,

I am not quite sure why there is no best advice section for rllib. Like for training neural nets, you usually gain better performance when scaling the In- and Output.

So my advice would always be to scale the action space and the observation space to the range 0-1 or -1 - 1.

PS. The bad results from above lead from a bug, so that the agent couldn’t see the observation of a specific value, but the value calculated before. Also the KL-Target was too low.

Hey @SebastianBo1995,your good advice was very much appreciated.