Offline data and off-policy estimation

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I’m using Ray Tune to train an RL policy with offline data and specifying weighted importance sampling as the tune metric. Evaluation duration is set to 1,000 and evaluation duration unit is set to episodes. Does that mean that for each sample Ray Tune will use 1,000 randomly selected episodes from the offline data for validation, and use the rest for training?

part of config file

config = {}
config[‘env’] = None
config[‘explore’] = False
config[‘input’] = offline_data_file_names
config[‘input_evaluation’] = [‘is’, ‘wis’]
config[‘evaluation_duration’] = 1000
config[‘evaluation_duration_unit’] = ‘episodes’

part of tune.run invocation

analysis = tune.run(
DQNTrainer,
num_samples=10,
config=config,
metric=’wis’,
mode=‘max’,
checkpoint_at_end=True,

)

Thanks,
Stefan

Hey @steff007 , thanks for the question!

Yes, I agree, this is actually confusing. Your input_evaluation setting has actually nothing to do with the other evaluation_... settings.

RLlib runs a) training iteration (collect samples from env/offline file) + update model and b) an evaluation step (run n episodes/timesteps using separate workers and report results).

For the evaluation step (b):
evaluation_duration=1000 and evaluation_duration_unit=episodes means that you run through 1000 episodes each of these steps.

Now the OPE setting (input_evaluation) refers to the training step. So the two methods: IS (importance sampling) and WIS (weighted importance sampling) are applied to the batches used for training(!), not those batches/trajectories that the evaulation workers walk through.

We should probably unify the input_evaluation settings with the evaluation_... settings or make this more clear in the config comments.

Hi Sven,

Thanks for your reply.

If I understand this correctly when config[‘input_evaluation’] = [‘is’, ‘wis’], the off-policy estimates that Ray Tune computes for each sample is based on the data that was used to train the corresponding policy?

If that’s the case, it seems that we shouldn’t rely on Ray Tune to identify the best hyper-parameter values since the corresponding policy could be overfit to the training data. Another approach would be to split the offline data into train and test, use the train data to train multiple policies using Ray Tune, and then separately evaluate those policies using ray.rllib.offline.is_estimator.ImportanceSamplingEstimator and/or ray.rllib.offline.wis_estimatorWeightedImportanceSamplingEstimator to identify the best policy. What do you think?

In a future release would it be possible for Ray Tune to use k-fold cross-validation to train an RL policy using offline data?

Regards,
Stefan