Thanks to the official rllib example custom_metrics_and_callbacks.py , I am able to gather the max and mean advantages by accessing the advantages with postprocessed_batch["advantages"]
in the on_postprocess_trajectory()
call back.
Now I have another straightforward question. Since * Robust PLR may switch train and eval during each episode based on the sample replay decision. To enable switch training and eval for all workers per episode, is by using policies["default_policy"].model.eval()
or policies["default_policy"].model.train()
should be sufficient enough for me to toggle training and eval for specific episodes?