Offline data and off-policy estimation

steff007 · April 19, 2022, 5:07pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I’m using Ray Tune to train an RL policy with offline data and specifying weighted importance sampling as the tune metric. Evaluation duration is set to 1,000 and evaluation duration unit is set to episodes. Does that mean that for each sample Ray Tune will use 1,000 randomly selected episodes from the offline data for validation, and use the rest for training?

part of config file

config = {}
config[‘env’] = None
config[‘explore’] = False
config[‘input’] = offline_data_file_names
config[‘input_evaluation’] = [‘is’, ‘wis’]
config[‘evaluation_duration’] = 1000
config[‘evaluation_duration_unit’] = ‘episodes’

part of tune.run invocation

analysis = tune.run(
DQNTrainer,
num_samples=10,
config=config,
metric=’wis’,
mode=‘max’,
checkpoint_at_end=True,
…
)

Thanks,
Stefan

sven1977 · May 9, 2022, 10:27am

Hey @steff007 , thanks for the question!

Yes, I agree, this is actually confusing. Your input_evaluation setting has actually nothing to do with the other evaluation_... settings.

RLlib runs a) training iteration (collect samples from env/offline file) + update model and b) an evaluation step (run n episodes/timesteps using separate workers and report results).

For the evaluation step (b):
evaluation_duration=1000 and evaluation_duration_unit=episodes means that you run through 1000 episodes each of these steps.

Now the OPE setting (input_evaluation) refers to the training step. So the two methods: IS (importance sampling) and WIS (weighted importance sampling) are applied to the batches used for training(!), not those batches/trajectories that the evaulation workers walk through.

We should probably unify the input_evaluation settings with the evaluation_... settings or make this more clear in the config comments.

steff · May 10, 2022, 5:03am

Hi Sven,

Thanks for your reply.

If I understand this correctly when config[‘input_evaluation’] = [‘is’, ‘wis’], the off-policy estimates that Ray Tune computes for each sample is based on the data that was used to train the corresponding policy?

If that’s the case, it seems that we shouldn’t rely on Ray Tune to identify the best hyper-parameter values since the corresponding policy could be overfit to the training data. Another approach would be to split the offline data into train and test, use the train data to train multiple policies using Ray Tune, and then separately evaluate those policies using ray.rllib.offline.is_estimator.ImportanceSamplingEstimator and/or ray.rllib.offline.wis_estimatorWeightedImportanceSamplingEstimator to identify the best policy. What do you think?

In a future release would it be possible for Ray Tune to use k-fold cross-validation to train an RL policy using offline data?

Regards,
Stefan

rapotdar · June 3, 2022, 12:42am

Hi Stefan, sorry for the late response!

You might be able to do the train-test split right now as follows:

config["input"]=training_input_files
config["input_evaluation"]=[]
config["evaluation_config"]={"input": validation_input_files, "input_evaluation": ["is", "wis"]})

Basically, you can use a separate eval dataset to come up with your OPE estimates on your eval worker through “evaluation_config”.

steff · July 20, 2022, 3:22pm

Hi Rohan,

Thank you for this suggestion, it works great.

I noticed that a doubly robust OPE class was recently added to master in the following location ray/doubly_robust.py at master · ray-project/ray · GitHub and that you are one of the contributors. Do you know when this class will be released?

Sven mentioned in another discussion that the RLLib team is also working on other SOTA off-policy estimation methods. Do you know which ones are being worked on, and when they’ll be released?

Thanks,
Stefan

Topic		Replies	Views
Off-policy evaluation - how to control batch sample size RLlib	4	230	May 19, 2023
RLLIB not working with Tune with sample batch input RLlib	25	2618	October 4, 2022
Doubly Robust off-policy estimation method RLlib	6	461	August 3, 2022
Offline RL evaluation Configure Algorithm, Training, Evaluation, Scaling	1	397	April 17, 2023
Roll out CQL policy RLlib	8	649	November 25, 2021

Offline data and off-policy estimation

part of config file

part of tune.run invocation

Related topics