Inconsistent actions from Algorithm.compute_single_action

How severe does this issue affect your experience of using Ray?

  • High: It blocks me from completing my task.

I have a custom environment. After running PPO training, I am computing actions from observations. The observation space is eight values. The action space is one of four values - 0, 1, 2, or 3.

When I reset my environment, I always get the same observation values. However, I get different actions if I call compute_single_action(obs) from the observations after resetting the environment.

I would expect that for equal observation values input, I would get the same output action every time the compute_single_action(obs) function is called.

Any help understanding why the output from compute_single_action(obs) changes from one function invocation to the next, with equal input, would be appreciated.

I have repeated the analysis using the CartPole-v1 environment and have the same concern. If I take an observation and repeatedly execute compute_single_action(observation), i.e.,repeatedly call the function with that observation, I get different actions. For instance, in one test, I got six 0 actions and four 1 actions.

Have I mistaken the function compute_single_action(observation) as the policy function in RL? If so, what should I use to access the policy function?

Hi @doug57,

This is likely expected behavior. Check out the exploration setting here:

Also take note of this quote from the documentation:
“IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!”

Thank you for your response. I appreciate you confirming that this is the expected behavior. I will look into using “explore=False” during inference.