Working with offline data: v_gain formulation

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

The following v_gain formulation is from the RLLib Working With Offline Data documentation page

“v_gain: v_target / max(v_behavior, 1e-8), averaged over episodes in the batch. v_gain > 1.0 indicates that the policy is better than the policy that generated the behavior data.”

This formulation seems to suggest that there is an assumption that the value of the behavior policy is not negative since a negative value is replaced with 1e-8.

In our application there are cases when v_behavior is negative, and as a result v_gain value blows up.

Here is a simple example:

v_behavior = -1 (is replaced with 1e-8)
v_target = 1
v_gain = 1 / 1e-8 = 1e8

An alternative v_gain formulation would be to use a percent change calculation: 100 * (v_target - v_behavior) / |v_behavior|