How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
- Very High: This implies rllib’s results cannot be trusted or I have done something wrong (I would be happy to be wrong here, but I don’t see how I would be!)
Either something is wrong with how my rollout class is using the policies or the data returned by rllib is wrong.
rllib is telling me that a given policy achieved win-rates of 0.89 → 0.95 → 0.99 and when I get a run from algo.evaluate()
that matches.
However, if I take those same models and run them in a rollout manually, I’m getting completely different results e.g., win_rate = 0.48
You can find the necessary files here: README.md · GitHub
Top line results:
step 0 manual evaluate: 0.458
step 0 rllib evaluate: 0.89
rllib win-rates from train results: 0.85 --> 0.94 --> 0.99
post train rllib evaluate: 1.0
post train manual eval: 0.434