Train / Evaluate hist stats not even close to matching manual evaluation stats

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.
  • Very High: This implies rllib’s results cannot be trusted or I have done something wrong (I would be happy to be wrong here, but I don’t see how I would be!)

Either something is wrong with how my rollout class is using the policies or the data returned by rllib is wrong.

rllib is telling me that a given policy achieved win-rates of 0.89 → 0.95 → 0.99 and when I get a run from algo.evaluate() that matches.

However, if I take those same models and run them in a rollout manually, I’m getting completely different results e.g., win_rate = 0.48

You can find the necessary files here: README.md · GitHub

Top line results:

step 0 manual evaluate:  0.458
step 0 rllib evaluate:  0.89
rllib win-rates from train results:  0.85 --> 0.94 --> 0.99
post train rllib evaluate:  1.0
post train manual eval:  0.434

Fixed this by setting the explore boolean to False.