Working with offline data: v_gain formulation

steff · August 16, 2022, 11:03pm

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

The following v_gain formulation is from the RLLib Working With Offline Data documentation page https://docs.ray.io/en/master/rllib/rllib-offline.html:

“v_gain: v_target / max(v_behavior, 1e-8), averaged over episodes in the batch. v_gain > 1.0 indicates that the policy is better than the policy that generated the behavior data.”

This formulation seems to suggest that there is an assumption that the value of the behavior policy is not negative since a negative value is replaced with 1e-8.

In our application there are cases when v_behavior is negative, and as a result v_gain value blows up.

Here is a simple example:

v_behavior = -1 (is replaced with 1e-8)
v_target = 1
v_gain = 1 / 1e-8 = 1e8

An alternative v_gain formulation would be to use a percent change calculation: 100 * (v_target - v_behavior) / |v_behavior|

Thoughts?

Thanks,
Stefan

Topic		Replies	Views
Policy Module (Model V2) RLlib	5	281	April 12, 2022
Off policy algorithms start doing the same action RLlib	9	341	December 31, 2022
Seeking recommendations for implementing Dual Curriculum Design in RLlib RLlib	13	576	April 11, 2023
Inconsistent actions from Algorithm.compute_single_action RLlib	3	229	June 14, 2023
Oscillating mean reward RLlib	0	253	June 13, 2023

Working with offline data: v_gain formulation

Related Topics