I have offline data that contains training patterns with:
recordings of inputs from a physical machine
actions of a behavior policy (=a person) operating the machine
rewards for the performance of that behavior policy for each time step
I have now converted all these (episodes, actions and rewards) into a JSON file via the SampleBatchBuilder.
Now I am wondering - why do I still have to provide an environment, even if I set “explore=false”?
Wouldn’t the offline dataset contain all information needed for offline off-policy training?
Wrapping my data into an environment would be quite artificial, because the step function would only work for actions that the behavior policy (=person) has performed in the recordings.
Training a realistic surrogate model isn’t very feasible for the given data.
Why are the trainers for offline RL insisting on an environment?
Hey - I’m pretty new to rllib but you still need to specify the env as the algorithms query the env object for reasons other than obtaining the data i.e., the CQL model checks that the environment is not discrete. I just create a very lightweight env (see below) and make sure the “disable_env_checking” environment option is set to True.
Hi @joshml, thanks for your feedback and sorry, forgot to check back at the thread.
Found a viable solution - one can provide the entries “observation_space” and “action_space” in the environment config of the offline algorithm. That way one doesn’t have to create an artificial environment, just to provide the space definitions.
Imho having this in the environment sub-config is a little confusing, because in this case it is more a specification of the offline dataset than the environment.
I think this is related to the circumstance that in most toy examples (and all online tutorials I could find) the offline data is sampled from an existing model based environment. Which doesn’t make too much sense, because if I have an environment and thus a model, I could use much more effective on-policy approaches that make use of being able to explore that model.
Think a more realistic tutorial where offline data is really coming from e.g. a CSV file that was recorded from the sensors of a physical machine would be much closer to the situation a typical offline RL user would face.
But long story short - it works that way!
Edit: Regarding the proposal to create a dummy environment. This also works, but one apparently has to create additional artificial step and reset functions that return an observation in the specified format, otherwise the environment checker will complain. So unless one can really implement a meaningful step function (so one can use pre-recorded data in combination with freshly sampled exploration data), it seems easier to just set the observation and action spaces in the algorithm config.