Hey, I am new to ray and working on my first project. The idea is to reproduce the following Matlab example:
Basically, a PI-Controller is tuned with RL and the parameters Kp and Ki are part of the policy and shall be extracted after the learning. I am trying to define a custom model which only has two neurons without bias for the policy mapping the (2,) observation space to the (1,) action space.
The documentation regarding custom models was of great help. However, I am struggeling to understand how the forward() and value_function() work and at which point they are called.
In the example above they also used the two neurons policy , a large value network to stabilize the learning process and a TD3 algorithm. Maybe an example is available providing some help?
Thanks in advance
They are called on several occasions that also depend on your framework.
I’ll explain for torch to keep it simple here:
On every timestep, a RolloutWorker gets an observation and will preprocess it before feeding it into
forward(). It will capture the output, calculate an action distribution for an action. On the learner thread, the sampled information will be used to again to call
forward(), but also
get_policy_output(). These are all a little special for TD3, because it inherits from DPPG. To get a better understanding, you must understand how DDPG works first. It’s all simpler for, say, an ordinary policy gradient algorithm.
There is no example involving such a classic PI-Controller application. But all neural networks are non-linear function approximations so ultimately they can serve to approximate any PI(D) controller that you throw at them.