Offline RL passing reward data from .json into environment

I am trying to build an environment for offline RL that uses custom data. I have followed here (Working With Offline Data — Ray 2.6.3) for creating the jsonwriter for converting external experiences. And I have also been following here (Environments — Ray 2.6.3) for creating a custom environment.

I am working with medical patient data so the reward is specific for each patient trajectory based on some medical tests. I am wondering how I can pass these stored rewards from the converted experiences .json into the environment. Or if there is some other method for loading this reward data into the environment and syncing it with the right “eps_id”?

Any help is appreciated. Thank you!

@kris thanks for the question. Good one! To me it sounds as if you are dealing with a problem that is close to the one handled in Converting external experiences to batch format. Do I correctly understand your problem?

In this case you do not need an environment, but rather batch the experiences you collected from your medical tests - in a similar, but not identical form like in supervised learning. The example shows how to do it when you have an environment from which the examples come. The reward is just part of this batch data as are observations, next observations, infos, eps_ids, agent_ids, timesteps t, etc.

Please clarify if this is indeed your problem.

@Lars_Simon_Zehnder Yes, thank you. I found a solution in another thread. I wish the docs for converting external experiences to batch format would have been a little bit more explicit that envs are not completely necessary and maybe provided an example like the one above in the thread.

My only question after that would be, how do I specify the offline data for both the input training and evaluation in the config?

config = (
DQNConfig()
.framework(“tf2”)
.offline_data(input_config={
“paths”: [“/root/DRL/reward1/0/train/output-2023-09-10_19-16-56_worker-0_0”],
“format”: “json”,
“input”:‘dataset’,
“explore”:False
},
)
.environment(observation_space = Dict({
‘obs’: Box(low = -10000, high = 100000, shape=(32,), dtype = np.float32)
}),
action_space = Discrete(2) )
.debugging(log_level=“INFO”)
.evaluation(
evaluation_interval=1,
evaluation_duration=10,
evaluation_num_workers=1,
evaluation_duration_unit=“episodes”,
evaluation_config={“paths”: [“/root/DRL/reward1/0/test/output-2023-09-10_19-16-56_worker-0_0”],
“format”: “json”,
“explore”: False,
“input”:‘dataset’},
off_policy_estimation_methods={
“is”: {“type”: ImportanceSampling},
“wis”: {“type”: WeightedImportanceSampling},
}
)
)

I ask this because the off-policy evaluation methods need the evaluation data to be a dataset, but this method provides it as a sampler input.

1 Like

@kris, great that you found some examples how to proceed. As a rule of thumb, the configuration for the evaluation workers is identical to the one used in training (only in_evaluation is set to True and also the evaluation worker numbers are specific).

In regard to evaluation there had been anopther issue in this board here. Usually you need an environment to roll out the policy online. In this case SAC was suggested to be used due to its similar setup.

In the other case that you want to estimate the policy’s performance on an offline dataset, you need to provide action_logp keys in the dataset as mentioned here