RLLIB not working with Tune with sample batch input

Jason_Weinberg · August 9, 2022, 12:37pm

My issue is that RLLIB is not properly logging metrics back to tune when training a model using sample batches. My config has no ENV parameter set, it only uses input sample batches. When metrics come back to tune the episode reward mean is seen as “nan” and so a best trial cannot be found when searching hyperparameters.

A similar issue was posted here but that fix specifically only solved the problem for the PolicyClient/PolicyServerInput usage of RLLIB. I am not using those objects here to train the DQN model. I am just training the model directly on the cluster without using the server.

github.com/ray-project/ray

[RLllib] Policy Server/Client metrics reporting fix

ray-project:master ← ArturNiederfahrenhorst:policyserverclientfix

opened 04:18PM - 13 May 22 UTC

ArturNiederfahrenhorst

+82 -36

## Why are these changes needed? Rollout Metrics are not being reported corre…ctly by the policy server. See for example: [this issue](https://discuss.ray.io/t/ray-tune-not-logging-episode-metrics-with-samplebatch-input/6080) or related issue. [Related TB log](https://tensorboard.dev/experiment/FAdsRNtpQwaTQKqjaS2JDQ/#scalars). Blue is with fix, orange without. ## Related issue number Closes #21246 ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [x] This PR is not tested :(

Can we have a similar fix integrated for the use case below, where I do not use the policyclient objects directly? There seems to be an issue with the trainer object not having an ENV for tune to interact with.

ENV_CONFIG = {
    'env': None,
    'observation_space': gym.spaces.Box(-np.inf, np.inf, shape=(STATE_DIM,)),
    'action_space': gym.spaces.Discrete(ACTION_DIM),
}
analysis = tune.run(dqn.DQNTrainer,
                    name='dqn-agent-model',
                    config={"framework": "torch",
                            "num_workers": 1,
                            "num_gpus": 0 if DEBUG else 1,
                            'batch_mode': 'complete_episodes',
                            "input": os.path.join(os.getcwd(), sample_batch_path),
                            **DQN_TRAIN_CONFIG,
                            **ENV_CONFIG,
                            },
                    stop={"training_iteration": 3},
                    keep_checkpoints_num=3,
                    local_dir=checkpoint_path,
                    sync_config=tune.SyncConfig(syncer=None)
                    )
best_trial = analysis.get_best_trial(metric='episode_reward_mean', mode='max', scope='all')
print(best_trial)
best = analysis.get_best_checkpoint(best_trial, metric='episode_reward_mean', mode='max')
print(best)
return best

Jason_Weinberg · September 19, 2022, 12:30pm

@arturn are you able to help with this?

arturn · September 19, 2022, 12:36pm

Hi @Jason_Weinberg ,

There will be no episode rewards if you do not step through an environment.
You can use OPE methods to get an estimate of how well your policy is doing.

Have a look at how we test OPEs to get an impression!

Jason_Weinberg · September 19, 2022, 2:51pm

Would it not be possible to just integrate the dummy metric collector into the algorithm code itself? In a real world scenario we don’t have environments and we do still have episode rewards. This seems to be a remnant of attaching to the idea of simulator environments.

arturn · September 20, 2022, 3:27pm

Sorry, I’m unable to follow. Could you elaborate on…

In a real world scenario we don’t have environments and we do still have episode rewards.

?

If you for example deploy your trained policy as some controller in an industrial system, you might not have access to certain sensory values to compute rewards. But you could possibly gain access and compute rewards.
In offline learning you do not even step through the environment so where would this data ever come from? It can only come from OPE and that’s what we offer.

I’m sorry if there is a misunderstanding.

Jason_Weinberg · September 20, 2022, 3:45pm

In my use case, for each episode we receive/record every state/action in real time as the agent interacts with our consumer application. We then query and attach reward to those logs 7 days later, because we don’t know rewards in real time, we just pick them up later using a query. We then take that logged data as offline batches and train with it. So we do have state/action/reward per step in episode format. We just want to train on that data later and use ray tune to help with hyperparameter search.

arturn · September 20, 2022, 4:39pm

Thanks for the insight.
Let me rephrase my question:
Where do the rewards come from that you would like to see collected?
I assume they come from the deployment of your policy but are not collected during the training.

So they only reflect how good the last version of your trained policy was, not how the policy is that you are training further from that data. Accidentally that’s the same before your first training iteration every time. But that’s not generally the case. And if you want to know how your policy performed, I think it would be better to extract this from the data you collect yourself directly and have RLlib use OPE.

Jason_Weinberg · September 20, 2022, 4:47pm

Yeah I was thinking about this a little bit. I think OPE is the correct metric, but how would we tune hyperparameters from it using ray tune? Is this a use case that has not come up before? I can obviously train a model just fine as is with predetermined hyperparameters using ope or any other metric. I really just wanted to integrate ray tune.

Maybe my use case is making this opaque to me. Can we not use a tuning algo then because essentially tune would need access to test out its parameters on my “real” environment and we obviously can’t let that happen?

Is it fair to say any chance of me using tune would require I go to a model based approach and create an env around that?

arturn · September 21, 2022, 9:35am

Maybe my use case is making this opaque to me. Can we not use a tuning algo then because essentially tune would need access to test out its parameters on my “real” environment and we obviously can’t let that happen?

You can tune for another metric than the mean episode reward. For example the OPE results!
You can even write your own code that returns only a number that reflects how well your tune Trainable is doing and have tune minimize/maximize that number. For example similar to this:

dqn = DQN(...)
policy = dqn.workers.local_worker().get_policy()
[...]
training_loop():
  # We don't care for these results since we compute our own estimate and this function is just a black-box to tune
  resutsl = dqn.train()
  sample_batch = synchronous_parallel_sample(worker_set=runner.workers)
  evaluator = FeatureImportance(policy=policy, gamma=0.0, repeat=repeat)
  estimate = evaluator.estimate(sample_batch)
  return estimate

But if you switch on RLlib’s OPE, the results will be included in the results dictionary returned every train step to tune as follows: …[“metrics”][“off_policy_estimator”][“<ope_name>”] and you can let tune know for what to optimize with the metric argument.

arturn · September 21, 2022, 9:37am

There are other OPEs than feature importance, but from the feature importance docstrings:

This implementation is also known as permutation importance that is defined to
be the variation of the model’s prediction when a single feature value is
randomly shuffled. In RLlib it is implemented as a custom OffPolicyEstimator
which is used to evaluate RLlib policies without performing environment
interactions.

Jason_Weinberg · September 21, 2022, 12:32pm

Ok this is great. I found these docs to be helpful. Hoping I can use this example and expand on it. Thank you for the insights, especially into the training loop of tune!

github.com

ray-project/ray/blob/5779ee764db18f3bb3fa6d60540b3ad9f8f776bb/doc/source/rllib/rllib-offline.rst

.. include:: /_includes/rllib/announcement.rst

.. include:: /_includes/rllib/we_are_hiring.rst

Working With Offline Data
=========================

Getting started
---------------

RLlib's offline dataset APIs enable working with experiences read from offline storage (e.g., disk, cloud storage, streaming systems, HDFS). For example, you might want to read experiences saved from previous training runs, or gathered from policies deployed in `web applications <https://arxiv.org/abs/1811.00260>`__. You can also log new agent experiences produced during online training for future use.

RLlib represents trajectory sequences (i.e., ``(s, a, r, s', ...)`` tuples) with `SampleBatch <https://github.com/ray-project/ray/blob/master/rllib/policy/sample_batch.py>`__ objects. Using a batch format enables efficient encoding and compression of experiences. During online training, RLlib uses `policy evaluation <rllib-concepts.html#policy-evaluation>`__ actors to generate batches of experiences in parallel using the current policy. RLlib also uses this same batch format for reading and writing experiences to offline storage.

Example: Training on previously saved experiences
-------------------------------------------------

.. note::

    For custom models and enviroments, you'll need to use the `Python API <rllib-training.html#basic-python-api>`__.

This file has been truncated. show original

from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.offline.estimators import (
    ImportanceSampling,
    WeightedImportanceSampling,
    DirectMethod,
    DoublyRobust,
)
from ray.rllib.offline.estimators.fqe_torch_model import FQETorchModel

config = (
    DQNConfig()
    .environment(env="CartPole-v0")
    .framework("torch")
    .offline_data(input_="/tmp/cartpole-out")
    .evaluation(
        evaluation_interval=1,
        evaluation_duration=10,
        evaluation_num_workers=1,
        evaluation_duration_unit="episodes",
        evaluation_config={"input": "/tmp/cartpole-eval"},
        off_policy_estimation_methods={
            "is": {"type": ImportanceSampling},
            "wis": {"type": WeightedImportanceSampling},
            "dm_fqe": {
                "type": DirectMethod,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
            "dr_fqe": {
                "type": DoublyRobust,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
        },
    )
)

algo = config.build()
for _ in range(100):
    algo.train()

arturn · September 21, 2022, 1:06pm

Awesome! Looking forward to hearing from you in terms of how this turns out

Jason_Weinberg · September 29, 2022, 1:21pm

Hey @arturn I tried to turn on the evaluation OPE metrics on the DQN model. Tune is still not picking up those metrics.

Jason_Weinberg · September 29, 2022, 1:23pm

@arturn Do you have an example of using tune at a lower level with the training loop? I can’t find any good documentation on that.

arturn · September 30, 2022, 3:28pm

Thanks for the picture! The arrow you drew points to what RLlib reports. Tune has no say in that.
Since RLlib is not reporting anything, the “issue” is not with tune.

What version of Ray are you using?

Can you try the following modified version of your snippet on master:

from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.offline.estimators import (
    ImportanceSampling,
    WeightedImportanceSampling,
    DirectMethod,
    DoublyRobust,
)
from ray.rllib.offline.estimators.fqe_torch_model import FQETorchModel

config = (
    DQNConfig()
    .environment(env="CartPole-v0")
    .framework("torch")
    .offline_data(input_="/Users/artur/code/ray/python/ray/rllib/tests/data/cartpole/large.json")
    .evaluation(
        evaluation_interval=1,
        evaluation_duration=1000,
        evaluation_num_workers=1,
        evaluation_parallel_to_training=True,
        evaluation_duration_unit="episodes",
        evaluation_config={"input": "sampler"},
        off_policy_estimation_methods={
            "is": {"type": ImportanceSampling},
            "wis": {"type": WeightedImportanceSampling},
            "dm_fqe": {
                "type": DirectMethod,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
            "dr_fqe": {
                "type": DoublyRobust,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
        },
    )
)

from ray.tune import Tuner
from ray import air
stop = {
        "training_iteration": 1
    }

t = Tuner(
    "DQN", param_space=config.to_dict(), run_config=air.RunConfig(stop=stop, verbose=2))
results = t.fit()
print(results.get_best_result().metrics)

I’m getting the OPE results you are looking for on my end so we should find out what’s the difference between our setups. v_behavior should be the discounted return averaged over episodes!

Jason_Weinberg · October 1, 2022, 11:26am

Hey @arturn thank you for the example code. That looks about identical to what I was doing. The only difference is the .environment parameter mine is

ENV_CONFIG = {
    'env': None,
    'observation_space': gym.spaces.Box(-np.inf, np.inf, shape=(STATE_DIM,)),
    'action_space': gym.spaces.Discrete(ACTION_DIM),
}

arturn · October 3, 2022, 8:03am

Afaics, that’s not part of your reproduction script. But it should not matter.
Anyways: Does my code work on our nightly wheels on your end? And does your own code work on our nightly wheels?

Jason_Weinberg · October 3, 2022, 10:04am

My script is at the very top (it has env=None etc in it). I linked to an example from the ray github docs above about offline learning, maybe that is what you are referring to?

To answer your question, yes I do use the nightlies. I was hoping to catch an update that fixes this issue at some point. I also enjoy the new Ray api syntax, it is much better imo. (https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl)

I was able to copy and run your example with no errors. I got this back in your response from your print statement. Interesting that the dr/dm methods return scores but is/wis do not.

‘off_policy_estimation’: {‘is’: {}, ‘wis’: {}, ‘dm_fqe’: {‘loss’: 0.9721615669772105}, ‘dr_fqe’: {‘loss’: 0.964671689641566}}}}

Jason_Weinberg · October 3, 2022, 12:06pm

I ran your code with no environment and got the missing scores as below

# rllib train \
#     --run=PG \
#     --env=CartPole-v0 \
#     --config='{"framework": "torch", "output": "/Users/jweinbe3/Documents/ray_example", "output_max_file_size": 5000000}' \
#     --stop='{"timesteps_total": 100000}'

from gym import spaces
from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.offline.estimators import (
    ImportanceSampling,
    WeightedImportanceSampling,
    DirectMethod,
    DoublyRobust,
)
import numpy as np
from ray.rllib.offline.estimators.fqe_torch_model import FQETorchModel


DEBUG = True

config = (
    DQNConfig()
    .resources(num_gpus=0 if DEBUG else 1)
    .debugging(seed=42 if DEBUG else None)
    .environment(env=None, action_space=spaces.Discrete(2),
                 observation_space=spaces.Box(np.array([-4.8000002e+00, -3.4028235e+38, -4.1887903e-01, -3.4028235e+38]),
                                              np.array([4.8000002e+00, 3.4028235e+38, 4.1887903e-01, 3.4028235e+38]), (4,), np.float32))
    .framework("torch")
    .offline_data(input_="/Users/jweinbe3/Documents/ray_example")
    .evaluation(
        evaluation_interval=1,
        evaluation_duration=1000,
        evaluation_num_workers=1,
        evaluation_parallel_to_training=True,
        evaluation_duration_unit="episodes",
        evaluation_config={"input": "/Users/jweinbe3/Documents/ray_example"},
        off_policy_estimation_methods={
            "is": {"type": ImportanceSampling},
            "wis": {"type": WeightedImportanceSampling},
            "dm_fqe": {
                "type": DirectMethod,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
            "dr_fqe": {
                "type": DoublyRobust,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
        },
    )
)

from ray.tune import Tuner
from ray import air
stop = {
        "training_iteration": 1
    }

t = Tuner(
    "DQN", param_space=config.to_dict(), run_config=air.RunConfig(stop=stop))
results = t.fit()
print(results.get_best_result().metrics)

Output

{'evaluation': {'episode_reward_max': nan, 'episode_reward_min': nan, 'episode_reward_mean': nan, 'episode_len_mean': nan, 'episode_media': {}, 'episodes_this_iter': 0, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [], 'episode_lengths': []}, 'sampler_perf': {}, 'num_faulty_episodes': 0, 'num_agent_steps_sampled_this_iter': 200000, 'num_env_steps_sampled_this_iter': 200000, 'timesteps_this_iter': 200000, 'off_policy_estimator': {'is': {'v_behavior': 71.56536521593185, 'v_target': 18.059243003719995, 'v_gain': 0.31237773674484337, 'v_behavior_std': 20.91077007684791, 'v_target_std': 46.671140611645654, 'v_gain_std': 0.7329956392103542}, 'wis': {'v_behavior': 71.56536521593185, 'v_target': 68.42401369219736, 'v_gain': 1.0540700766850335, 'v_behavior_std': 20.91077007684791, 'v_target_std': 286.03094352690056, 'v_gain_std': 4.112121540270352}, 'dm_fqe': {'v_behavior': 71.56536521593185, 'v_target': 0.01854718171184442, 'v_gain': 0.00035035406752660864, 'v_behavior_std': 20.91077007684791, 'v_target_std': 0.000986783183631341, 'v_gain_std': 0.0005032335443440929}, 'dr_fqe': {'v_behavior': 71.56536521593186, 'v_target': 18.090232510308883, 'v_gain': 0.3128705622792621, 'v_behavior_std': 20.910770076847918, 'v_target_std': 46.70760504597745, 'v_gain_std': 0.7333744750405439}}, 'num_healthy_workers': 1, 'num_recreated_workers': 0}, 'custom_metrics': {}, 'episode_media': {}, 'num_recreated_workers': 0, 'info': {'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.26329100131988525, 'mean_q': 3.2651255130767822, 'min_q': 0.17295058071613312, 'max_q': 9.400323867797852, 'cur_lr': 0.0005}, 'td_error': array([ 0.49190593, -2.372336  ,  0.15597558,  0.46284115,  0.27084064,
       -0.8244903 , -0.16397142,  0.42400658, -0.5302119 , -0.58858585,
       -2.811394  , -2.8838887 , -0.4350536 , -1.3666189 , -0.24729967,
       -1.5492    , -1.6928301 , -1.4913654 ,  0.17717671, -1.9149044 ,
       -1.1952596 , -2.4955556 , -2.0318499 , -0.850914  ,  0.9374056 ,
        1.0244606 , -0.9451004 , -1.6365018 , -0.7415261 , -2.1879308 ,
       -0.53181267,  0.5695535 ], dtype=float32), 'mean_td_error': -0.8429511189460754, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'off_policy_estimation': {'is': {}, 'wis': {}, 'dm_fqe': {'loss': 0.9612238324300886}, 'dr_fqe': {'loss': 0.9596758003567933}}}}

Jason_Weinberg · October 3, 2022, 4:07pm

@arturn I actually take that last post back. I just updated to the most recent nightly wheel (mine was like a week or so old) and it is now reporting the metrics!

{‘is’: {‘v_behavior’: 54.574182327979905, ‘v_behavior_std’: 16.582854133983922, ‘v_target’: 15.009755936128146, ‘v_target_std’: 10.516786201028937, ‘v_gain’: 0.2947487260962527, ‘v_delta’: -39.56442639185177}, ‘wis’: {‘v_behavior’: 54.574182327979905, ‘v_behavior_std’: 16.582854133983922, ‘v_target’: 54.57418232797991, ‘v_target_std’: 30.0197917089694, ‘v_gain’: 1.0, ‘v_delta’: 7.787548383930699e-15}, ‘dm_fqe’: {‘v_behavior’: 54.574182327979905, ‘v_behavior_std’: 16.582854133983922, ‘v_target’: -0.00041203076, ‘v_target_std’: 0.00033796096, ‘v_gain’: -7.881986724245273e-06, ‘v_delta’: -54.57459435874042}, ‘dr_fqe’: {‘v_behavior’: 54.57418232797991, ‘v_behavior_std’: 16.582854133983925, ‘v_target’: 15.011100837622504, ‘v_target_std’: 10.517293632581179, ‘v_gain’: 0.2947735725546284, ‘v_delta’: -39.563081490357405}}

Topic		Replies	Views
Ray tune not logging episode metrics with SampleBatch input RLlib	13	1257	August 9, 2022
Read Tune console output from Simple Q RLlib	8	1533	October 26, 2021
Ray Tune process hangs - thread synchronization issue? Ray Tune	16	767	February 10, 2023
How to write a trainable - for tuning a deterministic policy? RLlib	9	962	July 7, 2021
Tune log spam from RLlib trials Ray Tune	8	880	July 25, 2022

RLLIB not working with Tune with sample batch input

Related topics