Warnings while MARWIL training

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m trying to execute MARWIL training (offline RL) for a dataset I have recollected using an expert agent and the JsonWriter class, in a similar way to this example: ray/saving_experiences.py at master · ray-project/ray · GitHub

As I’m using a deterministic expert, the parameters should be action_prob=1.0 (action_logp=0.0). The problem is that while I train using MARWIL algorithm (beta=1.0, input_evaluation=['is','wis']), I get the following warnings:

(MARWILTrainer pid=168137) /home/<user>/.local/lib/python3.8/site-packages/ray/rllib/offline/is_estimator.py:25: RuntimeWarning: overflow encountered in double_scalars
(MARWILTrainer pid=168137) p.append(pt_prev * new_prob[t] / old_prob[t])
(MARWILTrainer pid=168137) /home/<user>/.local/lib/python3.8/site-packages/ray/rllib/offline/is_estimator.py:31: RuntimeWarning: invalid value encountered in double_scalars
(MARWILTrainer pid=168137) V_step_IS += p[t] * rewards[t] * self.gamma ** t
(MARWILTrainer pid=168137) /home/<user>/.local/lib/python3.8/site-packages/ray/rllib/offline/wis_estimator.py:31: RuntimeWarning: overflow encountered in double_scalars
(MARWILTrainer pid=168137) p.append(pt_prev * new_prob[t] / old_prob[t])
(MARWILTrainer pid=168137) /home/<user>/.local/lib/python3.8/site-packages/ray/rllib/offline/wis_estimator.py:45: RuntimeWarning: invalid value encountered in double_scalars
(MARWILTrainer pid=168137) V_step_WIS += p[t] / w_t * rewards[t] * self.gamma ** t

From what I understand, MARWIL uses this probability p[t] along with the rewards of the agent to weight and improve over Behaviour Cloning using 'is', 'wis' input evaluation.

Does these warnings mean that the agent can’t learn effectively under my config/conditions/dataset? (maybe it reverts back to Behaviour Cloning?)
Are the warnings due to the action_prob being equal to 1 or are other aspects of my config/training parameters?
How can I fix it?

Just in case it is interesting to note, there shouldn’t be problems with my rewards.

1 Like

Hi there! :wave:t3:

Welcome to the RLlib community! Would you like to ask your question in RLlib Office Hours? :writing_hand:t3: Just add discuss link to your question to this doc: RLlib Office Hours - Google Docs

Thanks! Hope to see you there!