RLLIB not working with Tune with sample batch input

My issue is that RLLIB is not properly logging metrics back to tune when training a model using sample batches. My config has no ENV parameter set, it only uses input sample batches. When metrics come back to tune the episode reward mean is seen as “nan” and so a best trial cannot be found when searching hyperparameters.

A similar issue was posted here but that fix specifically only solved the problem for the PolicyClient/PolicyServerInput usage of RLLIB. I am not using those objects here to train the DQN model. I am just training the model directly on the cluster without using the server.

Can we have a similar fix integrated for the use case below, where I do not use the policyclient objects directly? There seems to be an issue with the trainer object not having an ENV for tune to interact with.

ENV_CONFIG = {
    'env': None,
    'observation_space': gym.spaces.Box(-np.inf, np.inf, shape=(STATE_DIM,)),
    'action_space': gym.spaces.Discrete(ACTION_DIM),
}
analysis = tune.run(dqn.DQNTrainer,
                    name='dqn-agent-model',
                    config={"framework": "torch",
                            "num_workers": 1,
                            "num_gpus": 0 if DEBUG else 1,
                            'batch_mode': 'complete_episodes',
                            "input": os.path.join(os.getcwd(), sample_batch_path),
                            **DQN_TRAIN_CONFIG,
                            **ENV_CONFIG,
                            },
                    stop={"training_iteration": 3},
                    keep_checkpoints_num=3,
                    local_dir=checkpoint_path,
                    sync_config=tune.SyncConfig(syncer=None)
                    )
best_trial = analysis.get_best_trial(metric='episode_reward_mean', mode='max', scope='all')
print(best_trial)
best = analysis.get_best_checkpoint(best_trial, metric='episode_reward_mean', mode='max')
print(best)
return best
3 Likes

@arturn are you able to help with this?

Hi @Jason_Weinberg ,

There will be no episode rewards if you do not step through an environment.
You can use OPE methods to get an estimate of how well your policy is doing.

Have a look at how we test OPEs to get an impression! :slight_smile:

Would it not be possible to just integrate the dummy metric collector into the algorithm code itself? In a real world scenario we don’t have environments and we do still have episode rewards. This seems to be a remnant of attaching to the idea of simulator environments.

Sorry, I’m unable to follow. Could you elaborate on…

In a real world scenario we don’t have environments and we do still have episode rewards.

?

If you for example deploy your trained policy as some controller in an industrial system, you might not have access to certain sensory values to compute rewards. But you could possibly gain access and compute rewards.
In offline learning you do not even step through the environment so where would this data ever come from? It can only come from OPE and that’s what we offer.

I’m sorry if there is a misunderstanding.

In my use case, for each episode we receive/record every state/action in real time as the agent interacts with our consumer application. We then query and attach reward to those logs 7 days later, because we don’t know rewards in real time, we just pick them up later using a query. We then take that logged data as offline batches and train with it. So we do have state/action/reward per step in episode format. We just want to train on that data later and use ray tune to help with hyperparameter search.

Thanks for the insight.
Let me rephrase my question:
Where do the rewards come from that you would like to see collected?
I assume they come from the deployment of your policy but are not collected during the training.

So they only reflect how good the last version of your trained policy was, not how the policy is that you are training further from that data. Accidentally that’s the same before your first training iteration every time. But that’s not generally the case. And if you want to know how your policy performed, I think it would be better to extract this from the data you collect yourself directly and have RLlib use OPE.

Yeah I was thinking about this a little bit. I think OPE is the correct metric, but how would we tune hyperparameters from it using ray tune? Is this a use case that has not come up before? I can obviously train a model just fine as is with predetermined hyperparameters using ope or any other metric. I really just wanted to integrate ray tune.

Maybe my use case is making this opaque to me. Can we not use a tuning algo then because essentially tune would need access to test out its parameters on my “real” environment and we obviously can’t let that happen?

Is it fair to say any chance of me using tune would require I go to a model based approach and create an env around that?

Maybe my use case is making this opaque to me. Can we not use a tuning algo then because essentially tune would need access to test out its parameters on my “real” environment and we obviously can’t let that happen?

You can tune for another metric than the mean episode reward. For example the OPE results!
You can even write your own code that returns only a number that reflects how well your tune Trainable is doing and have tune minimize/maximize that number. For example similar to this:

dqn = DQN(...)
policy = dqn.workers.local_worker().get_policy()
[...]
training_loop():
  # We don't care for these results since we compute our own estimate and this function is just a black-box to tune
  resutsl = dqn.train()
  sample_batch = synchronous_parallel_sample(worker_set=runner.workers)
  evaluator = FeatureImportance(policy=policy, gamma=0.0, repeat=repeat)
  estimate = evaluator.estimate(sample_batch)
  return estimate

But if you switch on RLlib’s OPE, the results will be included in the results dictionary returned every train step to tune as follows: …[“metrics”][“off_policy_estimator”][“<ope_name>”] and you can let tune know for what to optimize with the metric argument.

There are other OPEs than feature importance, but from the feature importance docstrings:

This implementation is also known as permutation importance that is defined to
be the variation of the model’s prediction when a single feature value is
randomly shuffled. In RLlib it is implemented as a custom OffPolicyEstimator
which is used to evaluate RLlib policies without performing environment
interactions.

Ok this is great. I found these docs to be helpful. Hoping I can use this example and expand on it. Thank you for the insights, especially into the training loop of tune!

from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.offline.estimators import (
    ImportanceSampling,
    WeightedImportanceSampling,
    DirectMethod,
    DoublyRobust,
)
from ray.rllib.offline.estimators.fqe_torch_model import FQETorchModel

config = (
    DQNConfig()
    .environment(env="CartPole-v0")
    .framework("torch")
    .offline_data(input_="/tmp/cartpole-out")
    .evaluation(
        evaluation_interval=1,
        evaluation_duration=10,
        evaluation_num_workers=1,
        evaluation_duration_unit="episodes",
        evaluation_config={"input": "/tmp/cartpole-eval"},
        off_policy_estimation_methods={
            "is": {"type": ImportanceSampling},
            "wis": {"type": WeightedImportanceSampling},
            "dm_fqe": {
                "type": DirectMethod,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
            "dr_fqe": {
                "type": DoublyRobust,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
        },
    )
)

algo = config.build()
for _ in range(100):
    algo.train()

Awesome! Looking forward to hearing from you in terms of how this turns out :slight_smile:

Hey @arturn I tried to turn on the evaluation OPE metrics on the DQN model. Tune is still not picking up those metrics.

@arturn Do you have an example of using tune at a lower level with the training loop? I can’t find any good documentation on that.

Thanks for the picture! The arrow you drew points to what RLlib reports. Tune has no say in that.
Since RLlib is not reporting anything, the “issue” is not with tune.

What version of Ray are you using?

Can you try the following modified version of your snippet on master:

from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.offline.estimators import (
    ImportanceSampling,
    WeightedImportanceSampling,
    DirectMethod,
    DoublyRobust,
)
from ray.rllib.offline.estimators.fqe_torch_model import FQETorchModel

config = (
    DQNConfig()
    .environment(env="CartPole-v0")
    .framework("torch")
    .offline_data(input_="/Users/artur/code/ray/python/ray/rllib/tests/data/cartpole/large.json")
    .evaluation(
        evaluation_interval=1,
        evaluation_duration=1000,
        evaluation_num_workers=1,
        evaluation_parallel_to_training=True,
        evaluation_duration_unit="episodes",
        evaluation_config={"input": "sampler"},
        off_policy_estimation_methods={
            "is": {"type": ImportanceSampling},
            "wis": {"type": WeightedImportanceSampling},
            "dm_fqe": {
                "type": DirectMethod,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
            "dr_fqe": {
                "type": DoublyRobust,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
        },
    )
)

from ray.tune import Tuner
from ray import air
stop = {
        "training_iteration": 1
    }

t = Tuner(
    "DQN", param_space=config.to_dict(), run_config=air.RunConfig(stop=stop, verbose=2))
results = t.fit()
print(results.get_best_result().metrics)

I’m getting the OPE results you are looking for on my end so we should find out what’s the difference between our setups. v_behavior should be the discounted return averaged over episodes!

Hey @arturn thank you for the example code. That looks about identical to what I was doing. The only difference is the .environment parameter mine is

ENV_CONFIG = {
    'env': None,
    'observation_space': gym.spaces.Box(-np.inf, np.inf, shape=(STATE_DIM,)),
    'action_space': gym.spaces.Discrete(ACTION_DIM),
}

Afaics, that’s not part of your reproduction script. But it should not matter.
Anyways: Does my code work on our nightly wheels on your end? And does your own code work on our nightly wheels?

My script is at the very top (it has env=None etc in it). I linked to an example from the ray github docs above about offline learning, maybe that is what you are referring to?

To answer your question, yes I do use the nightlies. I was hoping to catch an update that fixes this issue at some point. I also enjoy the new Ray api syntax, it is much better imo. (https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl)

I was able to copy and run your example with no errors. I got this back in your response from your print statement. Interesting that the dr/dm methods return scores but is/wis do not.

‘off_policy_estimation’: {‘is’: {}, ‘wis’: {}, ‘dm_fqe’: {‘loss’: 0.9721615669772105}, ‘dr_fqe’: {‘loss’: 0.964671689641566}}}}

I ran your code with no environment and got the missing scores as below

# rllib train \
#     --run=PG \
#     --env=CartPole-v0 \
#     --config='{"framework": "torch", "output": "/Users/jweinbe3/Documents/ray_example", "output_max_file_size": 5000000}' \
#     --stop='{"timesteps_total": 100000}'

from gym import spaces
from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.offline.estimators import (
    ImportanceSampling,
    WeightedImportanceSampling,
    DirectMethod,
    DoublyRobust,
)
import numpy as np
from ray.rllib.offline.estimators.fqe_torch_model import FQETorchModel


DEBUG = True

config = (
    DQNConfig()
    .resources(num_gpus=0 if DEBUG else 1)
    .debugging(seed=42 if DEBUG else None)
    .environment(env=None, action_space=spaces.Discrete(2),
                 observation_space=spaces.Box(np.array([-4.8000002e+00, -3.4028235e+38, -4.1887903e-01, -3.4028235e+38]),
                                              np.array([4.8000002e+00, 3.4028235e+38, 4.1887903e-01, 3.4028235e+38]), (4,), np.float32))
    .framework("torch")
    .offline_data(input_="/Users/jweinbe3/Documents/ray_example")
    .evaluation(
        evaluation_interval=1,
        evaluation_duration=1000,
        evaluation_num_workers=1,
        evaluation_parallel_to_training=True,
        evaluation_duration_unit="episodes",
        evaluation_config={"input": "/Users/jweinbe3/Documents/ray_example"},
        off_policy_estimation_methods={
            "is": {"type": ImportanceSampling},
            "wis": {"type": WeightedImportanceSampling},
            "dm_fqe": {
                "type": DirectMethod,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
            "dr_fqe": {
                "type": DoublyRobust,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
        },
    )
)

from ray.tune import Tuner
from ray import air
stop = {
        "training_iteration": 1
    }

t = Tuner(
    "DQN", param_space=config.to_dict(), run_config=air.RunConfig(stop=stop))
results = t.fit()
print(results.get_best_result().metrics)

Output

{'evaluation': {'episode_reward_max': nan, 'episode_reward_min': nan, 'episode_reward_mean': nan, 'episode_len_mean': nan, 'episode_media': {}, 'episodes_this_iter': 0, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [], 'episode_lengths': []}, 'sampler_perf': {}, 'num_faulty_episodes': 0, 'num_agent_steps_sampled_this_iter': 200000, 'num_env_steps_sampled_this_iter': 200000, 'timesteps_this_iter': 200000, 'off_policy_estimator': {'is': {'v_behavior': 71.56536521593185, 'v_target': 18.059243003719995, 'v_gain': 0.31237773674484337, 'v_behavior_std': 20.91077007684791, 'v_target_std': 46.671140611645654, 'v_gain_std': 0.7329956392103542}, 'wis': {'v_behavior': 71.56536521593185, 'v_target': 68.42401369219736, 'v_gain': 1.0540700766850335, 'v_behavior_std': 20.91077007684791, 'v_target_std': 286.03094352690056, 'v_gain_std': 4.112121540270352}, 'dm_fqe': {'v_behavior': 71.56536521593185, 'v_target': 0.01854718171184442, 'v_gain': 0.00035035406752660864, 'v_behavior_std': 20.91077007684791, 'v_target_std': 0.000986783183631341, 'v_gain_std': 0.0005032335443440929}, 'dr_fqe': {'v_behavior': 71.56536521593186, 'v_target': 18.090232510308883, 'v_gain': 0.3128705622792621, 'v_behavior_std': 20.910770076847918, 'v_target_std': 46.70760504597745, 'v_gain_std': 0.7333744750405439}}, 'num_healthy_workers': 1, 'num_recreated_workers': 0}, 'custom_metrics': {}, 'episode_media': {}, 'num_recreated_workers': 0, 'info': {'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.26329100131988525, 'mean_q': 3.2651255130767822, 'min_q': 0.17295058071613312, 'max_q': 9.400323867797852, 'cur_lr': 0.0005}, 'td_error': array([ 0.49190593, -2.372336  ,  0.15597558,  0.46284115,  0.27084064,
       -0.8244903 , -0.16397142,  0.42400658, -0.5302119 , -0.58858585,
       -2.811394  , -2.8838887 , -0.4350536 , -1.3666189 , -0.24729967,
       -1.5492    , -1.6928301 , -1.4913654 ,  0.17717671, -1.9149044 ,
       -1.1952596 , -2.4955556 , -2.0318499 , -0.850914  ,  0.9374056 ,
        1.0244606 , -0.9451004 , -1.6365018 , -0.7415261 , -2.1879308 ,
       -0.53181267,  0.5695535 ], dtype=float32), 'mean_td_error': -0.8429511189460754, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'off_policy_estimation': {'is': {}, 'wis': {}, 'dm_fqe': {'loss': 0.9612238324300886}, 'dr_fqe': {'loss': 0.9596758003567933}}}}

@arturn I actually take that last post back. I just updated to the most recent nightly wheel (mine was like a week or so old) and it is now reporting the metrics!

{‘is’: {‘v_behavior’: 54.574182327979905, ‘v_behavior_std’: 16.582854133983922, ‘v_target’: 15.009755936128146, ‘v_target_std’: 10.516786201028937, ‘v_gain’: 0.2947487260962527, ‘v_delta’: -39.56442639185177}, ‘wis’: {‘v_behavior’: 54.574182327979905, ‘v_behavior_std’: 16.582854133983922, ‘v_target’: 54.57418232797991, ‘v_target_std’: 30.0197917089694, ‘v_gain’: 1.0, ‘v_delta’: 7.787548383930699e-15}, ‘dm_fqe’: {‘v_behavior’: 54.574182327979905, ‘v_behavior_std’: 16.582854133983922, ‘v_target’: -0.00041203076, ‘v_target_std’: 0.00033796096, ‘v_gain’: -7.881986724245273e-06, ‘v_delta’: -54.57459435874042}, ‘dr_fqe’: {‘v_behavior’: 54.57418232797991, ‘v_behavior_std’: 16.582854133983925, ‘v_target’: 15.011100837622504, ‘v_target_std’: 10.517293632581179, ‘v_gain’: 0.2947735725546284, ‘v_delta’: -39.563081490357405}}