There was an error changing the trajecy_tory_view_api into continuous action space

The trajecy_tory_view_API is the discrete action model, I want to change it to the continuous action model, but there are always mistakes after I change it
obs space

Tuple((
   Box(-5000, 5000, (18,), dtype=np.float32),
  Box(low=-1.0, high=1.0, shape=(3,), dtype=np.float32)
    ))

action space
Box(np.array([-1., -1., -1.,-1.]), np.array([+1., +1., +1.,+1.]), dtype=np.float32)

I have the following questions to ask
1 the probability that a continuous action needs to be output as an action, how should I set it inside?
2 in the forward function, states is used in RNN. What is the function of seq_lens?
The following code

class TorchFrameStackingCartPoleModelL(TorchModelV2, nn.Module):
    """A simple FC model that takes the last n observations as input."""

    def __init__(self,
                 obs_space,
                 action_space,
                 num_outputs,
                 model_config,
                 name,
                 num_frames=3):
        nn.Module.__init__(self)
        super(TorchFrameStackingCartPoleModelL, self).__init__(
            obs_space, action_space, None, model_config, name)

        self.num_frames = num_frames
        self.num_outputs = num_outputs

        # Construct actual (very simple) FC model.
        #assert len(obs_space.shape) == 1
        in_size = self.num_frames * (21 + 4 + 1)
        self.layer1 = SlimFC(
            in_size=in_size, out_size=256, activation_fn="relu")
        self.layer2 = SlimFC(in_size=256, out_size=256, activation_fn="relu")
        self.out = SlimFC(
            in_size=256, out_size=self.num_outputs, activation_fn="linear")
        self.values = SlimFC(in_size=256, out_size=1, activation_fn="linear")

        self._last_value = None

        self.view_requirements["prev_n_obs"] = ViewRequirement(
            data_col="obs",
            shift="-{}:0".format(num_frames - 1),
            space=obs_space)
        self.view_requirements["prev_n_rewards"] = ViewRequirement(
            data_col="rewards", shift="-{}:-1".format(self.num_frames))
        self.view_requirements["prev_n_actions"] = ViewRequirement(
            data_col="actions",
            shift="-{}:-1".format(self.num_frames),
            space=self.action_space)

    def forward(self, input_dict, states, seq_lens):
        obs = input_dict["prev_n_obs"]
        obs = torch.reshape(obs,
                            [-1, 21 * self.num_frames])
        rewards = torch.reshape(input_dict["prev_n_rewards"],
                                [-1, self.num_frames])
        actions = input_dict["prev_n_actions"]
        actions = torch.reshape(actions,
                                [-1, self.num_frames * 4])
        input_ = torch.cat([obs, actions, rewards], dim=-1)
        features = self.layer1(input_)
        features = self.layer2(features)
        out = self.out(features)
        self._last_value = self.values(features)
        return out, states

    def value_function(self):
        return torch.squeeze(self._last_value, -1)
File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\policy\torch_policy.py", line 376, in _compute_action_helper
    action_dist = dist_class(dist_inputs, self.model)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 186, in __init__
    self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\distributions\normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\distributions\distribution.py", line 53, in __init__
    raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter scale has invalid values

Any chance you can file an issue against RLlib with a minimal reproducible script?
We will take a look. Thanks.

thanks you! @gjoliver
What I’ve done is I’ve changed the program to continuous action. Everything else stays the same.
Note that to change discrete actions into continuous actions, line105 should be commented out in gym.envs.classic_control.CartPoleEnv
OS: win10
ray :1.8.0

from gym.spaces import Box
import numpy as np
from gym import spaces, logger
from gym.envs.classic_control import CartPoleEnv


class StatelessCartPole(CartPoleEnv):
    """Partially observable variant of the CartPole gym environment.

    https://github.com/openai/gym/blob/master/gym/envs/classic_control/
    cartpole.py

    We delete the x- and angular velocity components of the state, so that it
    can only be solved by a memory enhanced model (policy).
    """

    def __init__(self, config=None):
        super().__init__()

        # Fix our observation-space (remove 2 velocity components).
        high = np.array(
            [
                self.x_threshold * 2,
                self.theta_threshold_radians * 2,
            ],
            dtype=np.float32)

        self.observation_space = Box(low=-high, high=high, dtype=np.float32)
        self.action_space = spaces.Box(low=-high, high=high, dtype=np.float32)
    def step(self, action):
        next_obs, reward, done, info = super().step(1)
        # next_obs is [x-pos, x-veloc, angle, angle-veloc]
        return np.array([next_obs[0], next_obs[2]]), reward, done, info

    def reset(self):
        init_obs = super().reset()
        # init_obs is [x-pos, x-veloc, angle, angle-veloc]
        return np.array([init_obs[0], init_obs[2]])

from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.torch.misc import SlimFC
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.policy.view_requirement import ViewRequirement
from ray.rllib.utils.framework import try_import_tf, try_import_torch
from ray.rllib.utils.tf_ops import one_hot
from ray.rllib.utils.torch_ops import one_hot as torch_one_hot

tf1, tf, tfv = try_import_tf()
torch, nn = try_import_torch()
class TorchFrameStackingCartPoleModel(TorchModelV2, nn.Module):
    """A simple FC model that takes the last n observations as input."""

    def __init__(self,
                 obs_space,
                 action_space,
                 num_outputs,
                 model_config,
                 name,
                 num_frames=3):
        nn.Module.__init__(self)
        super(TorchFrameStackingCartPoleModel, self).__init__(
            obs_space, action_space, None, model_config, name)

        self.num_frames = num_frames
        self.num_outputs = num_outputs

        # Construct actual (very simple) FC model.
        #assert len(obs_space.shape) == 1
        in_size = self.num_frames * (2 + 2 + 1)
        self.layer1 = SlimFC(
            in_size=in_size, out_size=256, activation_fn="relu")
        self.layer2 = SlimFC(in_size=256, out_size=256, activation_fn="relu")
        self.out = SlimFC(
            in_size=256, out_size=self.num_outputs, activation_fn="linear")
        self.values = SlimFC(in_size=256, out_size=1, activation_fn="linear")

        self._last_value = None

        self.view_requirements["prev_n_obs"] = ViewRequirement(
            data_col="obs",
            shift="-{}:0".format(num_frames - 1),
            space=obs_space)
        self.view_requirements["prev_n_rewards"] = ViewRequirement(
            data_col="rewards", shift="-{}:-1".format(self.num_frames))
        self.view_requirements["prev_n_actions"] = ViewRequirement(
            data_col="actions",
            shift="-{}:-1".format(self.num_frames),
            space=self.action_space)

    def forward(self, input_dict, states, seq_lens):
        obs = input_dict["prev_n_obs"]
        obs = torch.reshape(obs,
                            [-1, 2 * self.num_frames])
        rewards = torch.reshape(input_dict["prev_n_rewards"],
                                [-1, self.num_frames])
        actions = input_dict["prev_n_actions"]
                            
        actions = torch.reshape(actions,
                                [-1, self.num_frames * 2])
        #print(obs, actions, rewards)
        input_ = torch.cat([obs, actions, rewards], dim=-1)
        features = self.layer1(input_)
        features = self.layer2(features)
        out = self.out(features)
        self._last_value = self.values(features)
        return out, states

    def value_function(self):
        return torch.squeeze(self._last_value, -1)


import argparse
import numpy as np

import ray
from ray.rllib.agents.ppo import PPOTrainer

from ray.rllib.models.catalog import ModelCatalog
from ray.rllib.utils.framework import try_import_tf
from ray.rllib.utils.test_utils import check_learning_achieved
from ray import tune

tf1, tf, tfv = try_import_tf()

parser = argparse.ArgumentParser()
parser.add_argument(
    "--run",
    type=str,
    default="PPO",
    help="The RLlib-registered algorithm to use.")
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "tfe", "torch"],
    default="torch",
    help="The DL framework specifier.")
parser.add_argument(
    "--as-test",
    action="store_true",
    help="Whether this script should be run as a test: --stop-reward must "
    "be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
    "--stop-iters",
    type=int,
    default=50,
    help="Number of iterations to train.")
parser.add_argument(
    "--stop-timesteps",
    type=int,
    default=200000,
    help="Number of timesteps to train.")
parser.add_argument(
    "--stop-reward",
    type=float,
    default=150.0,
    help="Reward at which we stop training.")

if __name__ == "__main__":
    args = parser.parse_args()
    ray.init(num_cpus=3)

    num_frames = 3

    ModelCatalog.register_custom_model(
        "frame_stack_model", FrameStackingCartPoleModel
        if args.framework != "torch" else TorchFrameStackingCartPoleModel)

    config = {
        "env": StatelessCartPole,
        "model": {
            "vf_share_layers": True,
            "custom_model": "frame_stack_model",
            "custom_model_config": {
                "num_frames": num_frames,
            },

            # To compare against a simple LSTM:
            # "use_lstm": True,
            # "lstm_use_prev_action": True,
            # "lstm_use_prev_reward": True,

            # To compare against a simple attention net:
            # "use_attention": True,
            # "attention_use_n_prev_actions": 1,
            # "attention_use_n_prev_rewards": 1,
        },
        "num_sgd_iter": 5,
        "vf_loss_coeff": 0.0001,
        "framework": args.framework,
    }

    stop = {
        "training_iteration": args.stop_iters,
        "timesteps_total": args.stop_timesteps,
        "episode_reward_mean": args.stop_reward,
    }
    results = tune.run(
        args.run, config=config, stop=stop, verbose=2, checkpoint_at_end=True)

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)

    checkpoints = results.get_trial_checkpoints_paths(
        trial=results.get_best_trial("episode_reward_mean", mode="max"),
        metric="episode_reward_mean")

    checkpoint_path = checkpoints[0][0]
    trainer = PPOTrainer(config)
    trainer.restore(checkpoint_path)

    # Inference loop.
    env = StatelessCartPole()

    # Run manual inference loop for n episodes.
    for _ in range(10):
        episode_reward = 0.0
        reward = 0.0
        action = 0
        done = False
        obs = env.reset()
        while not done:
            # Create a dummy action using the same observation n times,
            # as well as dummy prev-n-actions and prev-n-rewards.
            action, state, logits = trainer.compute_single_action(
                input_dict={
                    "obs": obs,
                    "prev_n_obs": np.stack([obs for _ in range(num_frames)]),
                    "prev_n_actions": np.stack([0 for _ in range(num_frames)]),
                    "prev_n_rewards": np.stack(
                        [1.0 for _ in range(num_frames)]),
                },
                full_fetch=True)
            obs, reward, done, info = env.step(action)
            episode_reward += reward

        print(f"Episode reward={episode_reward}")

    ray.shutdown()

The expected error is

File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\evaluation\sampler.py", line 103, in next
    batches = [self.get_data()]
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\evaluation\sampler.py", line 233, in get_data
    item = next(self._env_runner)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\evaluation\sampler.py", line 622, in _env_runner
    eval_results = _do_policy_eval(
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\evaluation\sampler.py", line 1036, in _do_policy_eval
    policy.compute_actions_from_input_dict(
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\policy\torch_policy.py", line 302, in compute_actions_from_input_dict
    return self._compute_action_helper(input_dict, state_batches,
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\utils\threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\policy\torch_policy.py", line 376, in _compute_action_helper
    action_dist = dist_class(dist_inputs, self.model)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 186, in __init__
    self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\distributions\normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\distributions\distribution.py", line 53, in __init__
    raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values

Hi, @gjoliver Can you take a look for me? I guess it is the error of the forward in the model. Thank you very much

I don’t really understand what you are trying to do, because the underlying CartPoleEnv doesn’t really understand a continuous action.
That’s probably why you have:

next_obs, reward, done, info = super().step(1)

So the whole thing is not gonna learn anything.

When I try to run your script, I also ran into issues with prev_n_actions. I had to change

"prev_n_actions": np.stack([0 for _ in range(num_frames)]),

into

"prev_n_actions": np.stack([0 for _ in range(num_frames * 2)]),

before the thing will run. Otherwise, you will get exception when you try to reshape actions in the Model’s forward().

But again, the prev_n_obs seems to just be current obs duplicated num_frames times, and prev_n_actions and prev_n_rewards are all hardcoded numbers, so I don’t know what it’s doing.
But at least the training runs after I made the fix.

1 Like

Thank you very much for your help. Because my environment is complicated, I have rewritten a minimal reproducible script. I am sorry for not explaining it clearly to you.

Please open an issue with your repro script at Issues · ray-project/ray · GitHub against RLlib.

Thanks