Hello there, I’m sort of a newbie here. I am trying to reproduce some of the atari games with R2D2, and I am unable to produce them. I’ve been blocked on this for quite some time and it would be a great help if anyone can help me here.
Thank you.
Here is a guide that many use when approaching these problems. You will have to make your problem more specific! Where do the issues lie? Are you tuning hyper parameters? Which ones? How does your config look?
from ray.rllib.agents.dqn.r2d2 import R2D2_DEFAULT_CONFIG,R2D2Trainer
import gym
from ray.rllib.agents.dqn import DQNTrainer, DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
from ray.rllib.agents.dqn.r2d2_tf_policy import R2D2TFPolicy
from ray.rllib.agents.dqn.r2d2_torch_policy import R2D2TorchPolicy
from ray.rllib.agents.trainer import Trainer
from ray.rllib.policy.policy import Policy
from ray.rllib.utils.annotations import override
from ray.rllib.utils.typing import TrainerConfigDict
print(R2D2Trainer)
R2D2_DEFAULT_CONFIG = Trainer.merge_trainer_configs(
DQN_DEFAULT_CONFIG, # See keys in impala.py, which are also supported.
{
# 'env':'PongNoFrameskip-v4',
'env' : 'PongDeterministic-v4',
# Learning rate for adam optimizer.
"lr": 1e-4,
# Discount factor.
"gamma": 0.997,
# Train batch size (in number of single timesteps).
"train_batch_size": 64 * 20,
# Adam epsilon hyper parameter
"adam_epsilon": 1e-3,
# Run in parallel by default.
"num_workers": 64,
# Batch mode must be complete_episodes.
"batch_mode": "complete_episodes",
# === Replay buffer ===
"replay_buffer_config": {
# For now we don't use the new ReplayBuffer API here
"_enable_replay_buffer_api": False,
"type": "MultiAgentReplayBuffer",
"prioritized_replay": True,
"prioritized_replay_alpha": 0.6,
# Beta parameter for sampling from prioritized replay buffer.
"prioritized_replay_beta": 0.4,
# Epsilon to add to the TD errors when updating priorities.
"prioritized_replay_eps": 1e-6,
# Size of the replay buffer (in sequences, not timesteps).
"capacity": 50000,
'learning_starts': 10000,
# Set automatically: The number
# of contiguous environment steps to
# replay at once. Will be calculated via
# model->max_seq_len + burn_in.
# Do not set this to any valid value!
"replay_sequence_length": -1,
},
'model':{
'use_lstm':True
},
"rollout_fragment_length":4,
# If True, assume a zero-initialized state input (no matter where in
# the episode the sequence is located).
# If False, store the initial states along with each SampleBatch, use
# it (as initial state when running through the network for training),
# and update that initial state during training (from the internal
# state outputs of the immediately preceding sequence).
"zero_init_states": True,
# If > 0, use the `burn_in` first steps of each replay-sampled sequence
# (starting either from all 0.0-values if `zero_init_state=True` or
# from the already stored values) to calculate an even more accurate
# initial states for the actual sequence (starting after this burn-in
# window). In the burn-in case, the actual length of the sequence
# used for loss calculation is `n - burn_in` time steps
# (n=LSTM’s/attention net’s max_seq_len).
"burn_in": 0,
# Whether to use the h-function from the paper [1] to scale target
# values in the R2D2-loss function:
# h(x) = sign(x)(|x| + 1 − 1) + εx
"use_h_function": True,
# The epsilon parameter from the R2D2 loss function (only used
# if `use_h_function`=True.
"h_function_epsilon": 1e-3,
# Update the target network every `target_network_update_freq` steps.
"target_network_update_freq": 2500,
# Experimental flag.
# If True, the execution plan API will not be used. Instead,
# a Trainer's `training_iteration` method will be called as-is each
# training iteration.
"_disable_execution_plan_api": False,
},
_allow_unknown_configs=True,
)
# R2D2_DEFAULT_CONFIG['env_config']={
# "parrot_shriek_range": gym.spaces.Box(-5.0, 5.0, (1, ))
# }
R2D2_DEFAULT_CONFIG['framework']='torch'
algo = R2D2Trainer(
config=R2D2_DEFAULT_CONFIG)
i=0
while True:
i+=1
results = algo.train()
print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")
Thank you so much for responding, this is the code that I am currently running. My issue is essentially it doesn’t learn even after playing with different parameters, it’s stuck around -20.
Thanks again.