LSTM wrapper giving issue when used with trainer.compute_single_action

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I need to use the LSTM wrapper in a quite complex application, but I’ve been getting issues, so I applied the LSTM wrapper to a simple example to first figure out how it works.

Everything works well until the testing starts and trainer.compute_single_action is called. The example reproducible from the file I put on my github: FinRL/LSTM_easy_example.ipynb at f363c1e2d65496c608df7d22e1dac46f5ea962f2 · NicoleRichards1998/FinRL · GitHub

I can paste it in here if it could help? I think I’m missing something small, but I’m so lost

Hi @NicoleRichards1998 ,

sorry that you ran into this blocker. I tried to run it on my machine with ray 1.11.0 and got the same bummer. I than modified a couple of things and still had an error - another one, however that has to be fixed in the source code (I will make a PR tomorrow and show you below the workaround for now).

Here is the code:

import gym
from ray.rllib.agents.ppo import PPOTrainer
import numpy as np
import pandas as pd


# Define your problem using python and openAI's gym API:
class ParrotEnv(gym.Env):
    """Environment in which an agent must learn to repeat the seen observations.
    Observations are float numbers indicating the to-be-repeated values,
    e.g. -1.0, 5.1, or 3.2.
    The action space is always the same as the observation space.
    Rewards are r=-abs(observation - action), for all steps.
    """

    def __init__(self, config):
        # Make the space (for actions and observations) configurable.
        self.action_space = config.get(
            "parrot_shriek_range", gym.spaces.Box(-1.0, 1.0, shape=(), dtype=np.float32)
        )
        # Since actions should repeat observations, their spaces must be the
        # same.
        self.observation_space = self.action_space
        self.cur_obs = None
        self.episode_len = 0

    def reset(self):
        """Resets the episode and returns the initial observation of the new one."""
        # Reset the episode len.
        self.episode_len = 0
        # Sample a random number from our observation space.
        self.cur_obs = self.observation_space.sample()
        # Return initial observation.
        return self.cur_obs

    def step(self, action):
        """Takes a single step in the episode given `action`
        Returns: New observation, reward, done-flag, info-dict (empty).
        """
        # Set `done` flag after 10 steps.
        self.episode_len += 1
        done = self.episode_len >= 10
        # r = -abs(obs - action)
### CHANGED THIS ###
# As action is always a single value and `sum()` needs an 
# iterable that errors out. So we can just leave the `sum()`
# away.
        reward = -abs(self.cur_obs - action)
        # Set a new observation (random sample).
        self.cur_obs = self.observation_space.sample()
        return self.cur_obs, reward, done, {}


# Create an RLlib Trainer instance to learn how to act in the above
# environment.
trainer = PPOTrainer(
    config={
        # Env class to use (here: our gym.Env sub-class from above).
        "env": ParrotEnv,
        # Config dict to be passed to our custom env's constructor.
        "env_config": {"parrot_shriek_range": gym.spaces.Box(-5.0, 5.0, shape=(), dtype=np.float32)},
        # Parallelize environment rollouts.
        "num_workers": 0,
        "model":{
            "use_lstm": True,
            "lstm_cell_size": int(256),
            "lstm_use_prev_action": True,
            "lstm_use_prev_reward": True,
        }
    }
)

# Train for n iterations and report results (mean episode rewards).
# Since we have to guess 10 times and the optimal reward is 0.0
# (exact match between observation and action value),
# we can expect to reach an optimal episode reward of 0.0.
for i in range(2):
    results = trainer.train()
    print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")

# Perform inference (action computations) based on given env observations.
# Note that we are using a slightly simpler env here (-3.0 to 3.0, instead
# of -5.0 to 5.0!), however, this should still work as the agent has
# (hopefully) learned to "just always repeat the observation!".
env = ParrotEnv({"parrot_shriek_range": gym.spaces.Box(-3.0, 3.0, shape=(), dtype=np.float32)})
# Get the initial observation (some value between -10.0 and 10.0).
obs = env.reset()
state = [np.zeros([256], dtype=np.float64) for _ in range(2)]
prev_a = 0.0
prev_r = 0.0
done = False
total_reward = 0.0
# Play one episode.
while not done:
    # Compute a single action, given the current observation
    # from the environment.
    action, state, _ = trainer.compute_single_action(obs, state, prev_action=prev_a, prev_reward=prev_r)
    # Apply the computed action in the environment.
    obs, reward, done, info = env.step(action)
    prev_a = action
    prev_r = reward
    # Sum up rewards for reporting purposes.
    total_reward += reward
# Report results.
print(f"Played 1 episode; total-reward={total_reward}")

You will get an error when running this because a shape value turns in the source code to a float. Here is the fix for now you can enter into the tf_action_dist.py file on your system (under python<your-version>/site-packages/ray/rllib/models/tf/. Change the following

@override(ActionDistribution)
def required_model_output_shape(
        action_space: gym.Space,
        model_config: ModelConfigDict) -> Union[int, np.ndarray]:
    return np.prod(action_space.shape) * 2

to

@override(ActionDistribution)
def required_model_output_shape(
        action_space: gym.Space,
        model_config: ModelConfigDict) -> Union[int, np.ndarray]:
    return int(np.prod(action_space.shape)) * 2

Hope this helps
Simon

@sven1977 I will PR this tomorrow

1 Like

Thank you so much for looking into this issue!

Under which class in the tf_action_dist.py file must this change be made? (I’m sorry if this is a dumb question, the backend of these algorithms is very new to me)

The learning curve is sometimes steep with RLlib as a big focus lays on industry-readiness. I agree on this.

So, the distribution class is the DiagGaussian one. Sorry, for not mentioning it in my first answer.

1 Like

@NicoleRichards1998 ,

I have raised an issue.

PR is out.

@avnishn, @sven1977, @gjoliver can someone of you check and merge?

@NicoleRichards1998, the PR is merged with the master. You can either use the nightly build or wait for the next release.

@Lars_Simon_Zehnder that is incredibly exciting! Thank you so much!

When will the next release be? Because I tried to find out how I would go about with a nightly build and google had no good answers for me

1 Like

@NicoleRichards1998, I cannot tell when the next release comes, but you can take a look here to see how to install the nightly build.

Thank you so much for the resources! It works now, I’m so grateful for all your help

1 Like

Hi, sorry I’m back again. I still get an issue when I use the LSTM wrapper with the use I need it for. I made another colab notebook and tried to keep it as simple as possible so we can focus only on the problem at hand:

Similar to the easy LSTM example we worked with, it works during training but when I call trainer.compute_single_action I get a ValueError: Cannot feed value of shape (1,) for Tensor default_policy/prev_actions:0, which has shape (?, 1).

I did download the nightly so the tf_action_dist.py file should be correct and the reward function returns a single value, am I missing something?