I have a custom environment that outputs some property called as “p1” in the range 100 to 10^-4.
Algo - PPO
action space - box[+0.05, -0.05, 3]
episode = 10 steps
The aim of the agent is to get a state which gives p1 as close to 10^-4 as possible.
problem - Occurance of state where (property p1 < 0.1) is very rare (0.5 %). Hence the policy learned is sub-optimal .i.e. gets state where p1 = 0.1 and not 0.0001
I want the model to take larger actions i.e box(±0.05) for the first five steps and smaller actions like box(±0.001) for the last five steps.
One way of looking at your problem, is that with your action space as you defined it, the desired later actions box(±0.001) are a very tiny subset of the overall action space.
To address this, you might consider using a transform of your action space. Maybe something similar to log-modulus A log transformation of positive and negative values - The DO Loop (though you might need to change the formula a bit, say by multiplying your x by a large constant).
The right transform may make it much easier for the exploration to discover the good regions of the action space.