LSTM wrapper giving issue when used with trainer.compute_single_action

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I need to use the LSTM wrapper in a quite complex application, but I’ve been getting issues, so I applied the LSTM wrapper to a simple example to first figure out how it works.

Everything works well until the testing starts and trainer.compute_single_action is called. The example reproducible from the file I put on my github: FinRL/LSTM_easy_example.ipynb at f363c1e2d65496c608df7d22e1dac46f5ea962f2 · NicoleRichards1998/FinRL · GitHub

I can paste it in here if it could help? I think I’m missing something small, but I’m so lost

Hi @NicoleRichards1998 ,

sorry that you ran into this blocker. I tried to run it on my machine with ray 1.11.0 and got the same bummer. I than modified a couple of things and still had an error - another one, however that has to be fixed in the source code (I will make a PR tomorrow and show you below the workaround for now).

Here is the code:

import gym
from ray.rllib.agents.ppo import PPOTrainer
import numpy as np
import pandas as pd


# Define your problem using python and openAI's gym API:
class ParrotEnv(gym.Env):
    """Environment in which an agent must learn to repeat the seen observations.
    Observations are float numbers indicating the to-be-repeated values,
    e.g. -1.0, 5.1, or 3.2.
    The action space is always the same as the observation space.
    Rewards are r=-abs(observation - action), for all steps.
    """

    def __init__(self, config):
        # Make the space (for actions and observations) configurable.
        self.action_space = config.get(
            "parrot_shriek_range", gym.spaces.Box(-1.0, 1.0, shape=(), dtype=np.float32)
        )
        # Since actions should repeat observations, their spaces must be the
        # same.
        self.observation_space = self.action_space
        self.cur_obs = None
        self.episode_len = 0

    def reset(self):
        """Resets the episode and returns the initial observation of the new one."""
        # Reset the episode len.
        self.episode_len = 0
        # Sample a random number from our observation space.
        self.cur_obs = self.observation_space.sample()
        # Return initial observation.
        return self.cur_obs

    def step(self, action):
        """Takes a single step in the episode given `action`
        Returns: New observation, reward, done-flag, info-dict (empty).
        """
        # Set `done` flag after 10 steps.
        self.episode_len += 1
        done = self.episode_len >= 10
        # r = -abs(obs - action)
### CHANGED THIS ###
# As action is always a single value and `sum()` needs an 
# iterable that errors out. So we can just leave the `sum()`
# away.
        reward = -abs(self.cur_obs - action)
        # Set a new observation (random sample).
        self.cur_obs = self.observation_space.sample()
        return self.cur_obs, reward, done, {}


# Create an RLlib Trainer instance to learn how to act in the above
# environment.
trainer = PPOTrainer(
    config={
        # Env class to use (here: our gym.Env sub-class from above).
        "env": ParrotEnv,
        # Config dict to be passed to our custom env's constructor.
        "env_config": {"parrot_shriek_range": gym.spaces.Box(-5.0, 5.0, shape=(), dtype=np.float32)},
        # Parallelize environment rollouts.
        "num_workers": 0,
        "model":{
            "use_lstm": True,
            "lstm_cell_size": int(256),
            "lstm_use_prev_action": True,
            "lstm_use_prev_reward": True,
        }
    }
)

# Train for n iterations and report results (mean episode rewards).
# Since we have to guess 10 times and the optimal reward is 0.0
# (exact match between observation and action value),
# we can expect to reach an optimal episode reward of 0.0.
for i in range(2):
    results = trainer.train()
    print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")

# Perform inference (action computations) based on given env observations.
# Note that we are using a slightly simpler env here (-3.0 to 3.0, instead
# of -5.0 to 5.0!), however, this should still work as the agent has
# (hopefully) learned to "just always repeat the observation!".
env = ParrotEnv({"parrot_shriek_range": gym.spaces.Box(-3.0, 3.0, shape=(), dtype=np.float32)})
# Get the initial observation (some value between -10.0 and 10.0).
obs = env.reset()
state = [np.zeros([256], dtype=np.float64) for _ in range(2)]
prev_a = 0.0
prev_r = 0.0
done = False
total_reward = 0.0
# Play one episode.
while not done:
    # Compute a single action, given the current observation
    # from the environment.
    action, state, _ = trainer.compute_single_action(obs, state, prev_action=prev_a, prev_reward=prev_r)
    # Apply the computed action in the environment.
    obs, reward, done, info = env.step(action)
    prev_a = action
    prev_r = reward
    # Sum up rewards for reporting purposes.
    total_reward += reward
# Report results.
print(f"Played 1 episode; total-reward={total_reward}")

You will get an error when running this because a shape value turns in the source code to a float. Here is the fix for now you can enter into the tf_action_dist.py file on your system (under python<your-version>/site-packages/ray/rllib/models/tf/. Change the following

@override(ActionDistribution)
def required_model_output_shape(
        action_space: gym.Space,
        model_config: ModelConfigDict) -> Union[int, np.ndarray]:
    return np.prod(action_space.shape) * 2

to

@override(ActionDistribution)
def required_model_output_shape(
        action_space: gym.Space,
        model_config: ModelConfigDict) -> Union[int, np.ndarray]:
    return int(np.prod(action_space.shape)) * 2

Hope this helps
Simon

@sven1977 I will PR this tomorrow

Thank you so much for looking into this issue!

Under which class in the tf_action_dist.py file must this change be made? (I’m sorry if this is a dumb question, the backend of these algorithms is very new to me)

The learning curve is sometimes steep with RLlib as a big focus lays on industry-readiness. I agree on this.

So, the distribution class is the DiagGaussian one. Sorry, for not mentioning it in my first answer.

@NicoleRichards1998 ,

I have raised an issue.

PR is out.

@avnishn, @sven1977, @gjoliver can someone of you check and merge?

@NicoleRichards1998, the PR is merged with the master. You can either use the nightly build or wait for the next release.

@Lars_Simon_Zehnder that is incredibly exciting! Thank you so much!

When will the next release be? Because I tried to find out how I would go about with a nightly build and google had no good answers for me

@NicoleRichards1998, I cannot tell when the next release comes, but you can take a look here to see how to install the nightly build.

Thank you so much for the resources! It works now, I’m so grateful for all your help

Hi, sorry I’m back again. I still get an issue when I use the LSTM wrapper with the use I need it for. I made another colab notebook and tried to keep it as simple as possible so we can focus only on the problem at hand:

Similar to the easy LSTM example we worked with, it works during training but when I call trainer.compute_single_action I get a ValueError: Cannot feed value of shape (1,) for Tensor default_policy/prev_actions:0, which has shape (?, 1).

I did download the nightly so the tf_action_dist.py file should be correct and the reward function returns a single value, am I missing something?