Train / Evaluate hist stats not even close to matching manual evaluation stats

aadharna · March 18, 2023, 8:07am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.
Very High: This implies rllib’s results cannot be trusted or I have done something wrong (I would be happy to be wrong here, but I don’t see how I would be!)

Either something is wrong with how my rollout class is using the policies or the data returned by rllib is wrong.

rllib is telling me that a given policy achieved win-rates of 0.89 → 0.95 → 0.99 and when I get a run from algo.evaluate() that matches.

However, if I take those same models and run them in a rollout manually, I’m getting completely different results e.g., win_rate = 0.48

You can find the necessary files here: README.md · GitHub

Top line results:

step 0 manual evaluate:  0.458
step 0 rllib evaluate:  0.89
rllib win-rates from train results:  0.85 --> 0.94 --> 0.99
post train rllib evaluate:  1.0
post train manual eval:  0.434

aadharna · March 21, 2023, 10:35pm

Fixed this by setting the explore boolean to False.

Topic		Replies	Views
`rllib rollout` command seems to be training the network, not evaluating RLlib	3	761	January 22, 2021
[RLlib] Policy evaluation using CLI and Python API produce different reward distributions RLlib	2	283	August 2, 2023
Unable to replicate original PPO performance RLlib	0	201	May 10, 2024
Cannot reproduce training results in evaluation even on same dataset RLlib	1	558	November 20, 2022
Test reward much lower than training reward RLlib	3	469	July 17, 2022

Train / Evaluate hist stats not even close to matching manual evaluation stats

Related topics