When the off-policy evaluation associated with this code executes, it invokes WeightedImportanceSampling.estimate_on_single_episode(…). How can I get it to estimate using the whole dataset using the WeightedImportanceSampling.estimate_on_dataset(…) method that you mentioned?
Ah, my fault-currently we only call estimate_on_dataset for bandit problems i.e. when split_batch_by_episode = False.
Otherwise, we sample batches from the evaluation workers, call estimator.estimate(batch), which splits the episodes in the batch and calls estimate_on_single_episode. In this case the overall batch size is the same as your evaluation config.
So in the above example, we evaluate on a batch of 10 episodes collected from 1 rollout worker.