Policy Mapping Function based on Environment Observation

Is it possible to create an agent which uses different policies depending on an observation? For example, in a hypothetical Windy Gridworld environment where the wind can change direction spontaneously, an agent might choose a different policy depending on whether the wind is blowing North, South, East or West.
This is similar to Hierarchical Training, but the policy being used by the agent at every time step is determined individually and can change depending on the episode trajectory.

As an extension to this question, could you rig a policy mapping function to work like a workflow scheduler. i.e. its 2 am, run a workflow?

Hi @remonation ,

and welcome to the ray community. Isn’t that if the observations from the environemnt incvlude signals about the weather that the agent should be able to learn a policy complex enough to react to different weather conditions and choose a nearly optimal action in regard to certain weather conditions? This is what I would expect from such an agent when learned long enough on enough data.

I would ensure that the observations carry signals about the weather and that these signals can be learned by the agent (in a deep neural network or other fucntion approximators). There are a lot of pre-configured algorithm in RLlib that make it easy to implement such agents. See Algorithms in the docs.

Hope this helps

Hi @Lars_Simon_Zehnder ,
I managed to code out the hierarchical training solution for this case, and yes I agree, I think that the model should be able to recognize these weather patterns. However, the idea of doing hierarchical training is to see if some amount of manual information could be used to improve convergence and results.

1 Like

Interesting, you mean something, like the idea of hierarchical priors in Bayesian approaches?

No, just a basic hierarchy implementation, where the observation of the environment is used to indicate to an agent that a specific sub-policy should be used.