Hello, I’m very new at ray rllib.
Now, I’m doing my personal project about offline RL.
What I’m planning is extracting offline dataset from global optimization solvers and let MARWIL imitates its policy.
After check below MARWIL tuned cartpole example, and I found that the example’s offline data is consisted of eps_id, obs, actions, rewards, and also action_prob, action distribution inputs, and value targets in every episode.
My question is that for implementing MARWIL, those policy information (action_prob, distribution inputs…) is essential??
If not, what is the purpose of these data in this tuned example?