Roll out CQL policy

fksvensson · November 23, 2021, 12:49pm

Hello!

I am working with offline RL, specifically CQL. I have trained a policy offline using my prefered data. The policy is stared in a checkpoint and that I would like to restore. Since I want to evaluate my policy online on my environment I make some small adjustments to the configuration. The code looks something like this:

config['env'] = my_env

config['input'] = 'sampler'

trainer = CQLTrainer(config=config)

trainer.restore(checkpoint_path)

Running this I get the error:

ValueError: Unknown offline input! config['input'] must either be list of offline files (json) or a D4RL-specific InputReader specifier (e.g. 'd4rl.hopper-medium-v0').

This does not make sense to me. How can I evaluate a policy created by CQL online if it cannot use the sampler input?

I am using ray 1.4.0

Lars_Simon_Zehnder · November 23, 2021, 5:45pm

Hi @fksvensson ,

and welcome to the board. First of all I would like to ask, if you need this specific version of Ray because your algorithm depends on some specific settings that are now deprecated or because you are using an external environment that needs so? Otherwise I would suggest you to install v1.8.0 or at least v1.7.1.

Then, the input hyperparameter defines an input to the Offline API if you want to train on already collected data. In your case you want to evaluate your algorithm online by collecting data live. In this case the env hyperparameter should suffice. So, try to comment out config[input]='sampler' and see, if this makes it run.

Hope this helps

fksvensson · November 24, 2021, 8:29am

Hello Lars and thank you for your fast answer.

I started the project with Ray 1.4.0 and I am afraid I will have combatability issues with the rest of my code if I switch version. Do you know if there are any major changes in the offline RL API that would motivate the change?

Even when I do not set a specific input and only declare the env in the config, I get the same issue, since ‘sampler’ is the default input.

 Unknown offline input! config['input'] must either be list of offline files (json) or a D4RL-specific InputReader specifier (e.g. 'd4rl.hopper-medium-v0').

Do you know id there is any oother trick around this?

Best,
Frida

Lars_Simon_Zehnder · November 24, 2021, 8:50am

Hi @fksvensson ,

in version 1.5.0 there has been an enhancement of the Input API:

Added new “input API” for customizing offline datasets (shoutout to Julius F.). (#16957)

See [rllib] Enhancements to Input API for customizing offline datasets by juliusfrost · Pull Request #16957 · ray-project/ray · GitHub for the issue that has been included.

You still can create a virtual environment and test out the newest ray version.

Regarding your issue its hard to make remote guesses. Can you show your config here?

Best,
Simon

fksvensson · November 24, 2021, 2:08pm

Hi again!

I am using Pipenv and it seems like I cannot instal anythin >1.5.2. I did try with that and it is incompatble with the input reader that I have created for the data that I have.

So when i first aked the question, I loaded the config that I have trained the data with, but changed the input to ‘sampler’. This is the config:

{'Q_model': {'fcnet_activation': 'relu', 'fcnet_hiddens': [256, 256]}, 'bc_iters': 0, 'clip_actions': True, 'env': <class '__main__.OTADummy'>, 'evaluation_config': {}, 'evaluation_interval': None, 'evaluation_num_workers': 0, 'framework': 'tfe', 'horizon': 200, 'input': 'sampler', 'input_evaluation': [], 'learning_starts': 256, 'metrics_smoothing_episodes': 5, 'n_step': 3, 'no_done_at_end': True, 'normalize_actions': True, 'num_gpus': 0, 'num_workers': 0, 'optimization': {'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'entropy_learning_rate': 0.0003}, 'policy_model': {'fcnet_activation': 'relu', 'fcnet_hiddens': [256, 256]}, 'prioritized_replay': False, 'rollout_fragment_length': 1, 'soft_horizon': True, 'target_entropy': 'auto', 'target_network_update_freq': 1, 'tau': 0.005, 'timesteps_per_iteration': 1000, 'train_batch_size': 256}

When you suggested commenting out ‘sampler’, I skipped the step of loading the params entirely, started with an empty config and only set the env, resulting in:

{'env': <class '__main__.OTADummy'>}

Both of these get the mentioned error.

Best,
Frida

Lars_Simon_Zehnder · November 24, 2021, 6:47pm

Hi @fksvensson ,

I guess that the Python version might be the problem. I would use pyenv (if you are on MacOS-X you can use brew to install it. Pull Python 3.9 and then use

pyenv install 3.9.4
mkdir ray-1.8.0 
cd ray-1.8.0
pyenv local 3.9.4
python -m venv -venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install numpy tensorflow
python -m pip install "ray[default]"

and you should be good to go. ThatÄs the setup I use, so I can choose python versions and install with these versions the packages in a virtual env.

Regarding your problem:

Have you registered your input reader via register_input("custom_input", input_creator)? before requesting it in your config?
Have you tested your input reader? Does it really read in the samples and returns a SampleBatch as needed?
How did you write your ouputs? Did you use a specific ouput writer you coded yourself or did you use the default?
Do you actually need the input or can you deploy your policy via running the environment in the background?

I actually do not understand why you include all the training parameters when you want to do evaluation. Evaluation can be done quite easily by running

rllib rollout \
    ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint_1/checkpoint-1 \
    --run DQN --env CartPole-v0 --steps 10000

So you use your checkpoint then the policy (DQN here) then your env and the number of steps you want to evaluate. The workers sample experiences from the environment using the trained policy - there is no need for an input sampler. You can also do custom evaluation as shown in this example here.

If you still need something else, you can post your code here and we can take a look at it.

Best, Simon

mannyv · November 24, 2021, 8:41pm

Hi @fksvensson,

CQL was not designed to run with sampled data; only offline datasets. An easy way to do inference with it is to switch to SAC. The models for both should be the same. If you would rather hack CQL to accept a sampler then have a a look at this issue:

github.com/ray-project/ray

[RLlib][CQL] No samples in sample batch

opened 03:46PM - 06 Apr 21 UTC

closed 12:25AM - 07 Apr 21 UTC

mvindiola1

bug triage

### What is the problem? Ray version: nightly Perhaps the Ray team is aware …of this but just in case they are not. CQL as it exists in master right now is not saving data in the replay buffer and so is not training the policy. This is coming from here https://github.com/ray-project/ray/blob/64cc092959264cea42099690f13eceb6036e34d8/rllib/agents/cql/cql.py#L63-L71 https://github.com/ray-project/ray/blob/64cc092959264cea42099690f13eceb6036e34d8/rllib/agents/cql/cql.py#L98-L100 There are a couple other little things too: 1.) This comment is literally the exact opposite. There is a torch implementation but not a tf one https://github.com/ray-project/ray/blob/64cc092959264cea42099690f13eceb6036e34d8/rllib/agents/cql/tests/test_cql.py#L30-L32 2.) There is no learn_on_batch in the CQL default config so a user cannot add it to config to use one. https://github.com/ray-project/ray/blob/64cc092959264cea42099690f13eceb6036e34d8/rllib/agents/cql/cql.py#L113 3.) The default framework in the config is "tf" even though there is no implementation for it. Wouldn't it make more sense to make it "torch"? ### Reproduction (REQUIRED) ```python import ray from ray.rllib.agents.cql import cql from ray.rllib.agents.cql.cql import NoOpReplayBuffer from ray.rllib.agents.dqn.dqn import calculate_rr_weights from ray.rllib.execution import \ ParallelRollouts, \ Replay, \ TrainOneStep, \ UpdateTargetNetwork, \ Concurrently, \ StandardMetricsReporting, \ StoreToReplayBuffer from ray.rllib.execution.replay_buffer import LocalReplayBuffer from ray.rllib.policy.policy import LEARNER_STATS_KEY from ray.rllib.utils import framework_iterator ray.init(local_mode=True) config = cql.CQL_DEFAULT_CONFIG.copy() config["num_workers"] = 0 # Run locally. config["twin_q"] = True config["clip_actions"] = False config["normalize_actions"] = True config["learning_starts"] = 0 config["rollout_fragment_length"] = 100 config["train_batch_size"] = 10 config["framework"] = "torch" num_iterations = 2 def before_learn_on_batch(multi_agent_batch, workers, config): print(f"sample_batch count is: {multi_agent_batch.count}") print(f"sample_batch keys are: {multi_agent_batch.policy_batches.keys()}") return multi_agent_batch def execution_plan(workers, config): if config.get("prioritized_replay"): prio_args = { "prioritized_replay_alpha": config["prioritized_replay_alpha"], "prioritized_replay_beta": config["prioritized_replay_beta"], "prioritized_replay_eps": config["prioritized_replay_eps"], } else: prio_args = {} local_replay_buffer = LocalReplayBuffer( num_shards=1, learning_starts=config["learning_starts"], buffer_size=config["buffer_size"], replay_batch_size=config["train_batch_size"], replay_mode=config["multiagent"]["replay_mode"], replay_sequence_length=config.get("replay_sequence_length", 1), **prio_args) global replay_buffer replay_buffer = local_replay_buffer rollouts = ParallelRollouts(workers, mode="bulk_sync") store_op = rollouts.for_each( NoOpReplayBuffer(local_buffer=local_replay_buffer)) #store_op = rollouts.for_each( # StoreToReplayBuffer(local_buffer=local_replay_buffer)) def update_prio(item): samples, info_dict = item if config.get("prioritized_replay"): prio_dict = {} for policy_id, info in info_dict.items(): td_error = info.get("td_error", info[LEARNER_STATS_KEY].get("td_error")) prio_dict[policy_id] = (samples.policy_batches[policy_id] .data.get("batch_indexes"), td_error) local_replay_buffer.update_priorities(prio_dict) return info_dict #post_fn = config.get("before_learn_on_batch") or (lambda b, *a: b) post_fn = before_learn_on_batch replay_op = Replay(local_buffer=local_replay_buffer) \ .for_each(lambda x: post_fn(x, workers, config)) \ .for_each(TrainOneStep(workers)) \ .for_each(update_prio) \ .for_each(UpdateTargetNetwork( workers, config["target_network_update_freq"])) train_op = Concurrently( [store_op, replay_op], mode="round_robin", output_indexes=[1], round_robin_weights=calculate_rr_weights(config)) return StandardMetricsReporting(train_op, workers, config) trainer = cql.CQLTrainer.with_updates(execution_plan=execution_plan)(config=config, env="MountainCarContinuous-v0") for i in range(num_iterations): print(f"Iteration: {i}") trainer.train() trainer.stop() ``` Output looks like this: ``` Iteration: 0 sample_batch count is: 10 sample_batch keys are: dict_keys([]) Iteration: 1 sample_batch count is: 10 sample_batch keys are: dict_keys([]) sample_batch count is: 10 sample_batch keys are: dict_keys([]) sample_batch count is: 10 sample_batch keys are: dict_keys([]) sample_batch count is: 10 sample_batch keys are: dict_keys([]) sample_batch count is: 10 sample_batch keys are: dict_keys([]) sample_batch count is: 10 sample_batch keys are: dict_keys([]) sample_batch count is: 10 sample_batch keys are: dict_keys([]) ``` A fix that has sample data to train on is to change store_op. I have not verified that correct training actually occurs. ```python #store_op = rollouts.for_each( # NoOpReplayBuffer(local_buffer=local_replay_buffer)) store_op = rollouts.for_each( StoreToReplayBuffer(local_buffer=local_replay_buffer)) ``` Output looks like this: ``` Iteration: 0 sample_batch count is: 10 sample_batch keys are: dict_keys(['default_policy']) Iteration: 1 sample_batch count is: 10 sample_batch keys are: dict_keys(['default_policy']) sample_batch count is: 10 sample_batch keys are: dict_keys(['default_policy']) sample_batch count is: 10 sample_batch keys are: dict_keys(['default_policy']) sample_batch count is: 10 sample_batch keys are: dict_keys(['default_policy']) sample_batch count is: 10 sample_batch keys are: dict_keys(['default_policy']) sample_batch count is: 10 sample_batch keys are: dict_keys(['default_policy']) ``` - [x ] I have verified my script runs in a clean environment and reproduces the issue. - [x ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

Lars_Simon_Zehnder · November 24, 2021, 9:19pm

@mannyv ,

interesting insight. So there is actually no chance to evaluate the already trained policy of CQL, but by using another trainer? From the viewpoint of a practitioner I cannot completely follow this decision as usually I collect data from an environment with a behavioral policy. Then use offline learning and when I trained my policy I want to evaluate it later also inside of the environment (at least for some interesting hours ).

However, as far as I understood, it is only the replay buffer that does not get filled? Can we not simply turn it off as we want to make online evaluation anyway? And the trainer pulls simply samples from the rollout workers?

fksvensson · November 25, 2021, 12:59pm

Thank you both for your comments, I’m really glad to see that offline RL is an engaging subject.

According to https://pypi.org/pypi/ray/1.8.0/json it seems like it should be compatible with python 3.7, but it looks like it does not support my current macOS

curl -sL https://pypi.org/pypi/ray/1.8.0/json | jq '.releases["1.8.0"][].filename' | grep -i macos
"ray-1.8.0-cp36-cp36m-macosx_10_15_intel.whl"
"ray-1.8.0-cp37-cp37m-macosx_10_15_intel.whl"
"ray-1.8.0-cp38-cp38-macosx_10_15_x86_64.whl"
"ray-1.8.0-cp38-cp38-macosx_11_0_arm64.whl"
"ray-1.8.0-cp39-cp39-macosx_10_15_x86_64.whl"
"ray-1.8.0-cp39-cp39-macosx_11_0_arm64.whl"

As I am working on 10_14 right now. Even if I were to update it, I would update it to 10_11 on an intel machine, which is not supported. If i do need Ray 1.8.0, I would use a container, but is seems like that might not be my problem right now.

@mannyv, your suggestion does sound interesting, I did try

`
trainer = SACTrainer(config=config)
trainer.restore(checkpoint_path)

`
with my CQL chekpoint, but it looks like the weights are not matching. Do you think I could make them match by adapting hyperparameters or did I misunderstand you?

I will try the hack and get back to you.

Thanks,
Frida

Topic		Replies	Views
Do I have to change "input": "sampler" config when working with ExternalEnv API? RLlib	4	334	February 2, 2021
Offline data example Offline RL	4	658	April 14, 2023
Offline reinforcement learning without environment Offline RL	3	1305	November 29, 2023
Offline RL with DQN, PPO, etc Offline RL	0	318	November 5, 2023
Offline RL; incompatible dimensions RLlib	9	557	October 25, 2022

Roll out CQL policy

Related topics