The observation and rewards need to be normalized if the the observation values are over 1000 and rewards sometimes over 100 ? Or rllib normalizes the observation and rewards ?
- None: Just asking a question out of curiosity
The observation and rewards need to be normalized if the the observation values are over 1000 and rewards sometimes over 100 ? Or rllib normalizes the observation and rewards ?
@gjoliver Do you know if this is possible without connectors?
The only solution that I can think of is either using env wrappers or creating custom preprocessors and feeding that in to the policy’s preprocessor list via callbacks on algorithm_init method.
I think connectors (which will be released in 2.3) is a more elegant solution to this, but we need a couple of examples to show how this use-case of normalizing obs / reward spaces would be achieved with custom connectors.
Agreed. Would be awesome to build this as a connector, so that the trained policy will always operate in this mode regardless of the env it deals with.
Env wrapper is another way, but then you (the user) will have to manage it, and make sure whenever you want to use a policies trained with these normalizations, the wrapper needs to be applied.