Roll out CQL policy

Hello!

I am working with offline RL, specifically CQL. I have trained a policy offline using my prefered data. The policy is stared in a checkpoint and that I would like to restore. Since I want to evaluate my policy online on my environment I make some small adjustments to the configuration. The code looks something like this:

config['env'] = my_env

config['input'] = 'sampler'

trainer = CQLTrainer(config=config)

trainer.restore(checkpoint_path)

Running this I get the error:

ValueError: Unknown offline input! config['input'] must either be list of offline files (json) or a D4RL-specific InputReader specifier (e.g. 'd4rl.hopper-medium-v0').

This does not make sense to me. How can I evaluate a policy created by CQL online if it cannot use the sampler input?

I am using ray 1.4.0

Hi @fksvensson ,

and welcome to the board. First of all I would like to ask, if you need this specific version of Ray because your algorithm depends on some specific settings that are now deprecated or because you are using an external environment that needs so? Otherwise I would suggest you to install v1.8.0 or at least v1.7.1.

Then, the input hyperparameter defines an input to the Offline API if you want to train on already collected data. In your case you want to evaluate your algorithm online by collecting data live. In this case the env hyperparameter should suffice. So, try to comment out config[input]='sampler' and see, if this makes it run.

Hope this helps

Hello Lars and thank you for your fast answer.

I started the project with Ray 1.4.0 and I am afraid I will have combatability issues with the rest of my code if I switch version. Do you know if there are any major changes in the offline RL API that would motivate the change?

Even when I do not set a specific input and only declare the env in the config, I get the same issue, since ā€˜samplerā€™ is the default input.

 Unknown offline input! config['input'] must either be list of offline files (json) or a D4RL-specific InputReader specifier (e.g. 'd4rl.hopper-medium-v0').

Do you know id there is any oother trick around this?

Best,
Frida

Hi @fksvensson ,

in version 1.5.0 there has been an enhancement of the Input API:

Added new ā€œinput APIā€ for customizing offline datasets (shoutout to Julius F.). (#16957)

See [rllib] Enhancements to Input API for customizing offline datasets by juliusfrost Ā· Pull Request #16957 Ā· ray-project/ray Ā· GitHub for the issue that has been included.

You still can create a virtual environment and test out the newest ray version.

Regarding your issue its hard to make remote guesses. Can you show your config here?

Best,
Simon

Hi again!

I am using Pipenv and it seems like I cannot instal anythin >1.5.2. I did try with that and it is incompatble with the input reader that I have created for the data that I have.

So when i first aked the question, I loaded the config that I have trained the data with, but changed the input to ā€˜samplerā€™. This is the config:

{'Q_model': {'fcnet_activation': 'relu', 'fcnet_hiddens': [256, 256]}, 'bc_iters': 0, 'clip_actions': True, 'env': <class '__main__.OTADummy'>, 'evaluation_config': {}, 'evaluation_interval': None, 'evaluation_num_workers': 0, 'framework': 'tfe', 'horizon': 200, 'input': 'sampler', 'input_evaluation': [], 'learning_starts': 256, 'metrics_smoothing_episodes': 5, 'n_step': 3, 'no_done_at_end': True, 'normalize_actions': True, 'num_gpus': 0, 'num_workers': 0, 'optimization': {'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'entropy_learning_rate': 0.0003}, 'policy_model': {'fcnet_activation': 'relu', 'fcnet_hiddens': [256, 256]}, 'prioritized_replay': False, 'rollout_fragment_length': 1, 'soft_horizon': True, 'target_entropy': 'auto', 'target_network_update_freq': 1, 'tau': 0.005, 'timesteps_per_iteration': 1000, 'train_batch_size': 256}

When you suggested commenting out ā€˜samplerā€™, I skipped the step of loading the params entirely, started with an empty config and only set the env, resulting in:

{'env': <class '__main__.OTADummy'>}

Both of these get the mentioned error.

Best,
Frida

Hi @fksvensson ,

I guess that the Python version might be the problem. I would use pyenv (if you are on MacOS-X you can use brew to install it. Pull Python 3.9 and then use

pyenv install 3.9.4
mkdir ray-1.8.0 
cd ray-1.8.0
pyenv local 3.9.4
python -m venv -venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install numpy tensorflow
python -m pip install "ray[default]" 

and you should be good to go. ThatƄs the setup I use, so I can choose python versions and install with these versions the packages in a virtual env.

Regarding your problem:

  1. Have you registered your input reader via register_input("custom_input", input_creator)? before requesting it in your config?
  2. Have you tested your input reader? Does it really read in the samples and returns a SampleBatch as needed?
  3. How did you write your ouputs? Did you use a specific ouput writer you coded yourself or did you use the default?
  4. Do you actually need the input or can you deploy your policy via running the environment in the background?

I actually do not understand why you include all the training parameters when you want to do evaluation. Evaluation can be done quite easily by running

rllib rollout \
    ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint_1/checkpoint-1 \
    --run DQN --env CartPole-v0 --steps 10000

So you use your checkpoint then the policy (DQN here) then your env and the number of steps you want to evaluate. The workers sample experiences from the environment using the trained policy - there is no need for an input sampler. You can also do custom evaluation as shown in this example here.

If you still need something else, you can post your code here and we can take a look at it.

Best, Simon

Hi @fksvensson,

CQL was not designed to run with sampled data; only offline datasets. An easy way to do inference with it is to switch to SAC. The models for both should be the same. If you would rather hack CQL to accept a sampler then have a a look at this issue:

1 Like

@mannyv ,

interesting insight. So there is actually no chance to evaluate the already trained policy of CQL, but by using another trainer? From the viewpoint of a practitioner I cannot completely follow this decision as usually I collect data from an environment with a behavioral policy. Then use offline learning and when I trained my policy I want to evaluate it later also inside of the environment (at least for some interesting hours :grinning_face_with_smiling_eyes:).

However, as far as I understood, it is only the replay buffer that does not get filled? Can we not simply turn it off as we want to make online evaluation anyway? And the trainer pulls simply samples from the rollout workers?

Thank you both for your comments, Iā€™m really glad to see that offline RL is an engaging subject.

According to https://pypi.org/pypi/ray/1.8.0/json it seems like it should be compatible with python 3.7, but it looks like it does not support my current macOS

curl -sL https://pypi.org/pypi/ray/1.8.0/json | jq '.releases["1.8.0"][].filename' | grep -i macos
"ray-1.8.0-cp36-cp36m-macosx_10_15_intel.whl"
"ray-1.8.0-cp37-cp37m-macosx_10_15_intel.whl"
"ray-1.8.0-cp38-cp38-macosx_10_15_x86_64.whl"
"ray-1.8.0-cp38-cp38-macosx_11_0_arm64.whl"
"ray-1.8.0-cp39-cp39-macosx_10_15_x86_64.whl"
"ray-1.8.0-cp39-cp39-macosx_11_0_arm64.whl"

As I am working on 10_14 right now. Even if I were to update it, I would update it to 10_11 on an intel machine, which is not supported. If i do need Ray 1.8.0, I would use a container, but is seems like that might not be my problem right now.

@mannyv, your suggestion does sound interesting, I did try

`
trainer = SACTrainer(config=config)
trainer.restore(checkpoint_path)

`
with my CQL chekpoint, but it looks like the weights are not matching. Do you think I could make them match by adapting hyperparameters or did I misunderstand you?

I will try the hack and get back to you.

Thanks,
Frida