PPO + custom torch model causes KeyError: 'seq_lens' In tower 0 on device cpu

Hello!

Currently I am working with PPO and try to implement a custom torch model. I use the mountain car or at least an inherited version of it as environment. Unfortunately I got an error message and I can not identify the problem. I will put the error report as well as my python implementation below.
I would be very grateful for any hint, since I try to identify the issue since some days!

I use:
Ray 2.2.0
Python 3.10

/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/bin/python /mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/main.py
2023-02-25 13:20:39,656 INFO worker.py:1538 – Started a local Ray instance.
2023-02-25 13:20:41,073 WARNING algorithm_config.py:488 – Cannot create PPOConfig from given config_dict! Property max_seq_len not supported.
2023-02-25 13:20:41,087 INFO algorithm.py:501 – Current log_level is WARN. For more information, set ‘log_level’: ‘INFO’ / ‘DEBUG’ or use the -v and -vv flags.
(pid=1033785)
(RolloutWorker pid=1035681) 2023-02-25 13:20:44,604 WARNING env.py:147 – Your env doesn’t have a .spec.max_episode_steps attribute. This is fine if you have set ‘horizon’ in your config dictionary, or soft_horizon. However, if you haven’t, ‘horizon’ will default to infinity, and your environment will not be reset.
2023-02-25 13:20:44,652 WARNING util.py:66 – Install gputil for GPU system monitoring.
2023-02-25 13:20:44,667 WARNING deprecation.py:47 – DeprecationWarning: _get_slice_indices has been deprecated. This will raise an error in the future!
Traceback (most recent call last):
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 1128, in _worker
self.loss(model, self.dist_class, sample_batch)
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py”, line 92, in loss
B = len(train_batch[SampleBatch.SEQ_LENS])
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/policy/sample_batch.py”, line 818, in getitem
value = dict.getitem(self, key)
KeyError: ‘seq_lens’

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/main.py”, line 77, in
results = trainer.train()
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py”, line 367, in train
raise skipped from exception_cause(skipped)
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py”, line 364, in train
result = self.step()
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 749, in step
results, train_iter_ctx = self._run_one_training_iteration()
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 2623, in _run_one_training_iteration
results = self.training_step()
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py”, line 329, in training_step
train_results = train_one_step(self, train_batch)
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/execution/train_ops.py”, line 52, in train_one_step
info = do_minibatch_sgd(
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/utils/sgd.py”, line 129, in do_minibatch_sgd
local_worker.learn_on_batch(
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py”, line 1013, in learn_on_batch
info_out[pid] = policy.learn_on_batch(batch)
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/utils/threading.py”, line 24, in wrapper
return func(self, *a, **k)
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 616, in learn_on_batch
grads, fetches = self.compute_gradients(postprocessed_batch)
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/utils/threading.py”, line 24, in wrapper
return func(self, *a, **k)
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 816, in compute_gradients
tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 1213, in _multi_gpu_parallel_grad_calc
raise last_result[0] from last_result[1]
ValueError: seq_lens
tracebackTraceback (most recent call last):
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 1128, in _worker
self.loss(model, self.dist_class, sample_batch)
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py”, line 92, in loss
B = len(train_batch[SampleBatch.SEQ_LENS])
File “/mnt/hdd/PROJEKTE/PHD/01_work/DQL_Python/LunarLander_DoubleDeepQ_GitLabRepo/deep_q/PPO_Baseline/venv/lib/python3.10/site-packages/ray/rllib/policy/sample_batch.py”, line 818, in getitem
value = dict.getitem(self, key)
KeyError: ‘seq_lens’

In tower 0 on device cpu

Process finished with exit code 1

#####################################################################
Here starts my implementation
#####################################################################


main.py

import os
import numpy as np
import tensorboard
import gym

import ray
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.models import ModelCatalog
from ray.tune.logger import UnifiedLogger
from ray.tune.result import DEFAULT_RESULTS_DIR

from ego_torch_model_v04 import EgoTorchModelV04
from ego_tf_model import EgoTFModel
from ego_environment import EgoEnvironment

config = PPOConfig()

ModelCatalog.register_custom_model(“ego_model”, EgoTorchModelV04)
ray.init()

ray.init(local_mode=True)

trainer = PPOTrainer(config={
“env”: EgoEnvironment,
“env_config”: {“dummy”: 42},
“num_workers”: 1,
“num_envs_per_worker”: 1,
“framework”: “torch”,
“batch_mode”: “truncate_episodes”,
“train_batch_size”: 14,
“max_seq_len”: 1,
“horizon”: 10,
“sgd_minibatch_size”: 5,
“num_gpus”: 0,
“simple_optimizer”: True,
“model”: {“custom_model”: “ego_model”}
})

for n in range(1):
results = trainer.train()


ego_torch_model_v04.py

import numpy as np
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.utils.annotations import override
from ray.rllib.utils.framework import try_import_tf, try_import_torch

torch, nn = try_import_torch()
tf1, tf, tfv = try_import_tf()

class EgoTorchModelV04(TorchModelV2, nn.Module):
def init(self, obs_space, action_space, num_outputs, model_config, name, **kwargs):
TorchModelV2.init(self, obs_space, action_space, num_outputs, model_config, name)
nn.Module.init(self)

    num_inputs = obs_space.shape[0]
    hidden_size = 128

    self.critic = nn.Sequential(
        nn.Linear(num_inputs, hidden_size),
        nn.ReLU(),
        nn.Linear(hidden_size, 1)
    )

    self.actor = nn.Sequential(
        nn.Linear(num_inputs, hidden_size),
        nn.ReLU(),
        nn.Linear(hidden_size, num_outputs),
    )

@override(ModelV2)
def forward(self, input_dict, state, seq_lens):
    x = input_dict["obs"]
    value = self.critic(x)
    self.value = value
    mu = self.actor(x)
   
    return mu, list(value)

@override(ModelV2)
def value_function(self):
    return torch.reshape(self.value, [-1])

ego_environment.py

import gym
from gym.envs.classic_control.continuous_mountain_car import Continuous_MountainCarEnv

class EgoEnvironment(Continuous_MountainCarEnv):
def init(self, config):
super().init(self)
self.config = config

def reset(self):
    return super().reset()

def step(self, action):
    return super().step(action)

Since I am new here, I hope I have done everything in the correct way. If I have to provide more information please let me know.
Many thanks in advance!!

MRMarlies

Hi @MRMarlies ,

Can you use the “Preformatted Text” feature of Discrouse to format the code?
Right now it is quite hard to parse for me.
Please also consider opening a github issue with a complete, self-contained reproduction script (not multiple files).

Dear arturn!

First of all many thanks for your reply!!
Sorry for the circumstances, I am new here and not aware of all features, but it´s much appreciated to get your instructions. Thanks!

Today, I used the “Preformatted Text” feature and prepared only one file. The versions from Ray 2.2.0 and Python 3.10 are still the same. Running the python script below returns the following error:

KeyError: ‘seq_lens’
In tower 0 on device cpu

import os
from datetime import date
import numpy as np
import tempfile
import ray
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.logger import UnifiedLogger, DEFAULT_LOGGERS
from ray.tune.result import DEFAULT_RESULTS_DIR
from ray.rllib.models import ModelCatalog
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.models.torch.misc import SlimFC
from ray.rllib.models.torch.misc import normc_initializer as normc_init_torch
from ray.rllib.utils.annotations import override
from ray.rllib.utils.framework import try_import_tf, try_import_torch
from gym.envs.classic_control.continuous_mountain_car import Continuous_MountainCarEnv

torch, nn = try_import_torch()
tf1, tf, tfv = try_import_tf()

ray.init(local_mode=True)

class MountainCar(Continuous_MountainCarEnv):
    def __init__(self, config):
        super().__init__(self)
        self.config = config

    def reset(self):
        return super().reset()

    def step(self, action):
        return super().step(action)

class TorchModelV04(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config, name,  **kwarg):
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs, model_config, name)
        nn.Module.__init__(self)

        hidden_size = 256
        num_inputs = obs_space.shape[0]

        self.critic = nn.Sequential(
            nn.Linear(num_inputs, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1)
        )

        self.actor = nn.Sequential(
            nn.Linear(num_inputs, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, num_outputs),
            nn.Tanh())

    @override(ModelV2)
    def forward(self, input_dict, state, seq_lens):
        x = input_dict["obs"]
        v_values = self.critic(x)
        self.activations_layer_1 = list(self.critic.parameters())[1]
        self.v_values = v_values
        mean_std = self.actor(x)
        return mean_std, list(v_values)

    @override(ModelV2)
    def value_function(self):
        return torch.reshape(self.v_values, [-1])

    @override(ModelV2)
    def metrics(self):
        return {"activations_layer_1": list((self.activations_layer_1).detach().numpy())}


def custom_logger_creator():
    timestr = date.today().strftime("%Y-%m-%d_%H-%M-%S")
    logdir_prefix = "{}_{}_{}".format("PPO", "MountainCar", timestr)
    def logger_creator(config):
        print('DEFAULT_RESULTS_DIR')
        print(DEFAULT_RESULTS_DIR)

        if not os.path.exists(DEFAULT_RESULTS_DIR):
            os.makedirs(DEFAULT_RESULTS_DIR)
        logdir = tempfile.mkdtemp(prefix=logdir_prefix, dir=DEFAULT_RESULTS_DIR)
        loggers = list(DEFAULT_LOGGERS)
        #loggers.append(CustomLogger)
        return UnifiedLogger(config, logdir, loggers=loggers)
    return logger_creator

ppo_config = PPOConfig()
ModelCatalog.register_custom_model("torch_model_v04", TorchModelV04)

trainer = PPOTrainer(config={
        "env": MountainCar,
        "env_config": {"marvin": 42},
        "num_workers": 10,
        "num_envs_per_worker": 1,
        "framework": "torch",
        "batch_mode": "complete_episodes",
        "train_batch_size": 32,
        "horizon": 1000,
        "sgd_minibatch_size": 32,
        "num_gpus": 0,
        "entropy_coeff": 0.0001,
        "simple_optimizer": True,
        "model": {"custom_model": "torch_model_v04", "custom_model_config": {"marvin": 42}}
        }, logger_creator = custom_logger_creator()
)

for n in range(10000000):
    results = trainer.train()
    path_to_checkpoint = trainer.save()
    print(path_to_checkpoint)

I tried to debug the scripts and I assume the error comes from ppo_torch_policy.py. It seems to me the following condition causes the error:

        if state:
            B = len(train_batch[SampleBatch.SEQ_LENS])
            max_seq_len = logits.shape[0] // B
            mask = sequence_mask(
                train_batch[SampleBatch.SEQ_LENS],
                max_seq_len,
                time_major=model.is_time_major(),
            )
            mask = torch.reshape(mask, [-1])
            num_valid = torch.sum(mask)

            def reduce_mean_valid(t):
                return torch.sum(t[mask]) / num_valid

        # non-RNN case: No masking.
        else:
            mask = None
            reduce_mean_valid = torch.mean

Since I am not sure what causes the problem really, I kindly ask you for your expert knowledge.

Best regards,
MRMarlies

Hi @MRMarlies,

This is your error:

return mean_std, list(v_values) 

The forward method of the model does not return the values.The second argument is supposed to be the new state. Since you are not using a model with state you can just pass back the unmodified input state.

return mean_std, state

Hello mannyv!

Many thanks for your fast reply and support!! You solved my problem.

Greetings,
MRMarlies