How severe does this issue affect your experience of using Ray?
High: It blocks me from completing my task.
I have a custom environment. After running PPO training, I am computing actions from observations. The observation space is eight values. The action space is one of four values - 0, 1, 2, or 3.
When I reset my environment, I always get the same observation values. However, I get different actions if I call compute_single_action(obs) from the observations after resetting the environment.
I would expect that for equal observation values input, I would get the same output action every time the compute_single_action(obs) function is called.
Any help understanding why the output from compute_single_action(obs) changes from one function invocation to the next, with equal input, would be appreciated.
I have repeated the analysis using the CartPole-v1 environment and have the same concern. If I take an observation and repeatedly execute compute_single_action(observation), i.e.,repeatedly call the function with that observation, I get different actions. For instance, in one test, I got six 0 actions and four 1 actions.
Have I mistaken the function compute_single_action(observation) as the policy function in RL? If so, what should I use to access the policy function?
Also take note of this quote from the documentation:
“IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!”