Frame Stacking W/ Policy_Server + Policy_Client

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi all,

I am expirementing with an AI using LSTM and Attention Net wrappers. However, I want to compare the results against a more simple frame-stacking approach.

I have read throughhttps://github.com/ray-project/ray/blob/8e680c483ce326cefc62e44f68ab1a6948b1c3d2/doc/source/rllib/rllib-sample-collection.rst
and
https://docs.ray.io/en/latest/rllib/rllib-sample-collection.html#trajectory-view-api
but still don’t quite understand how to tell the PPO config to just stack the last x observations?

My policy_server:

import ray
from ray.rllib.env import PolicyServerInput
from ray.rllib.algorithms.ppo import PPOConfig

import numpy as np
import argparse
from gymnasium.spaces import MultiDiscrete, Box

ppo_config = PPOConfig()

parser = argparse.ArgumentParser(description='Optional app description')
parser.add_argument('-ip', type=str, help='IP of this device')

parser.add_argument('-checkpoint', type=str, help='location of checkpoint to restore from')

args = parser.parse_args()

def _input(ioctx):
    return PolicyServerInput(
        ioctx,
        args.ip,
        55556,
    )


x = 320
y = 240


# kl_coeff, ->deafult 0.2
# vf_loss_coeff used to be 0.01??
# "entropy_coeff": 0.00005,
# "clip_param": 0.1,
ppo_config.gamma = 0.998  # default 0.99
ppo_config.lambda_ = 0.99  # default 1.0???
ppo_config.kl_target = 0.01  # default 0.01
ppo_config.rollout_fragment_length = 512
ppo_config.train_batch_size = 10240
ppo_config.sgd_minibatch_size = 256
ppo_config.num_sgd_iter = 2 # default 30???
ppo_config.lr = 3.5e-5  # 5e-5
ppo_config.model = {
    # Share layers for value function. If you set this to True, it's
    # important to tune vf_loss_coeff.
    "vf_share_layers": False,

    #"use_lstm": True,
    #"max_seq_len": 32,
    #"lstm_cell_size": 128,
    #"lstm_use_prev_action": True,

     'use_attention': True,
     "max_seq_len": 64,
     "attention_num_transformer_units": 1,
     "attention_dim": 256,
     "attention_memory_inference": 128,
     "attention_memory_training": 128,
     "attention_num_heads": 8,
     "attention_head_dim": 32,
     "attention_position_wise_mlp_dim": 128,
     "attention_use_n_prev_actions": 0,
     "attention_use_n_prev_rewards": 0,
     "attention_init_gru_gate_bias": 2.0,

    "conv_filters": [],
    #"conv_activation": "relu",
    #"post_fcnet_hiddens": [512],
    #"post_fcnet_activation": "relu"
}
ppo_config.batch_mode = "complete_episodes"
ppo_config.simple_optimizer = True
ppo_config.num_gpus = 1


ppo_config.rollouts(num_rollout_workers=0, enable_connectors=False)

ppo_config.offline_data(input_=_input)

ppo_config.env = None
ppo_config.observation_space = Box(low=0, high=1, shape=(y, x, 1), dtype=np.float32)
ppo_config.action_space = MultiDiscrete(
    [
        2,  # W
        2,  # A
        2,  # S
        2,  # D
        2,  # Space
        2,  # H
        2,  # J
        2,  # K
        2  # L
    ]
)
ppo_config.env_config = {
    "sleep": True,
    'replayOn': False
}
ppo_config.framework_str = 'torch'
ppo_config.log_sys_usage = False
ppo_config.compress_observations = True
ppo_config.shuffle_sequences = False

ray.init(num_cpus=4, num_gpus=1, log_to_driver=False)

from ray import tune

name = "" + args.checkpoint
print(f"Starting: {name}")

tune.run("PPO",
         resume='AUTO',
         config=ppo_config.to_dict(),
         name=name, keep_checkpoints_num=None, checkpoint_score_attr="episode_reward_mean",
         max_failures=1,
         checkpoint_freq=5, checkpoint_at_end=True)

Thanks!

Hi @Denys_Ashikhin ,

if you want to use framestacking you need to implement it in your model, i.e. write a custom model that implements this via the Trajectory View API as shown for example here.

I see, if I redefine my cnn/FC model in a custom model, is it possible to have it auto-wrapped by lstm / attention net? Or do I need to do that part in the model as well?

Thanks!

@Denys_Ashikhin , that will not be possible together with framestacking out-of-the-box. The reason is in this case that it would mess with the view requirements, if you simply wrap it with the LSTMWraooer. But the latter gives you a good example how you could introduce recurrence into your model.

That’s okay, with frame-stacking the rnn properties were more of a curiosity.

I found this: ray/visionnet.py at master · ray-project/ray · GitHub
Which is basically what I would copy+paste as my starter model since I am doing a visionnet as the basis for the PPO.

However, I cannot quite figure out how to piece the num_frames from ray/trajectory_view_utilizing_models.py at master · ray-project/ray · GitHub

into it.

In the view trajectory example it uses in_size = self.num_frames * (obs_space.shape[0] + action_space.n + 1) but the visionnet uses:

  in_size = [w, h]
        for out_channels, kernel, stride in filters[:-1]:
            padding, out_size = same_padding(in_size, kernel, stride)
            layers.append(
                SlimConv2d(
                    in_channels,
                    out_channels,
                    kernel,
                    stride,
                    padding,
                    activation_fn=activation,
                )
            )
            in_channels = out_channels
            in_size = out_size

Which is completely different and I’m not sure how to tackle that.

Not to mention the forward passes seems completely different between the two.

If possible, could you provide a sample which makes the tweaks to the visionnet to essentialy take num_frames for frame stacking, or at least to get me started (and I can bug you some more as it goes)?

I would be very, very grateful!

@Denys_Ashikhin, what you could do is to framestack the observations in the channel dimension: instead of using the shape (w, h, c, f) you go for (w, h, c x f). Like this you should be able to use simply the VisionNet as is.

To make sure we are on the same page:
w= width
h=height
c=channels (usually 1 or 3, grayscale or rgb, in my case it’s all grayscale)
f=filters (conv filters?)

I like the simplicity of your approach but I feel like a few more variables would have to be changed right?

I.E
(w, h, in_channels) = obs_space.shape → would in_channels be num_frames ? I would also need to change the obs_space to have a shape of (width, height, num_frames) in the environment?

Also:

 in_size = [w, h]
        for out_channels, kernel, stride in filters[:-1]:
            padding, out_size = same_padding(in_size, kernel, stride)

Can I just add a third value to in_size ala in_size = [w, h, num_frames]?

For out_channels, kernel, stride = filters[-1]
These values get filled based on the conv_filters passed in, I’m not sure if this will be affected (and any values reliant on these later down the line) and if yes, how to change them.

Lastly, for the init (not even touching the forward :sweat_smile: ) I would need to add:

        self.view_requirements["prev_n_obs"] = ViewRequirement(
            data_col="obs", shift="-{}:0".format(num_frames - 1), space=obs_space
        )
        self.view_requirements["prev_n_rewards"] = ViewRequirement(
            data_col="rewards", shift="-{}:-1".format(self.num_frames)
        )
        self.view_requirements["prev_n_actions"] = ViewRequirement(
            data_col="actions",
            shift="-{}:-1".format(self.num_frames),
            space=self.action_space,
        )

Do I have to include all three? Or can I just add the prev_n_obs line? Assuming I don’t want my model to train based on prev_rewards and actions?

Thanks again for sticking with me!!

@Denys_Ashikhin , almost :slight_smile: f: frames. So th number of frames you want to include in your framestack.

I think I see where you are going, instead of the visionnet getting 4 obsverations to train, we are stacking them up as if they are individual channels (that would typically be for colour).

In that case, do I still need to edit my obs_space to be (width, height, frames)?

And all I would also need to add is:

self.view_requirements["prev_n_obs"] = ViewRequirement(
            data_col="obs", shift="-{}:0".format(num_frames - 1), space=obs_space
        )

At the bottom of the init?

If it’s clicking correctly now, I was overthinking it. Basically, changing my obs_space to have num_frames for channels would set everything correctly inside the base visionnet. Then adding the trajectory view would stack the images into the channels?

Or is this more manual, where I would manually stack up the images in the environment and specify its obs_space=(width, height, num_frames) when sampling without needing to change anything in the visionnet in the first place?

@Denys_Ashikhin , you are getting there. You have to edit the observations as they will come as (w, h, c, f) or (f, w, h, c) (I am not sure anymore). You have to convert that for your model. And the view requiremnts are needed.

Of yourse you can also make the framestacking in your environment, but the better design choice would be to do it in your algorithm.

What’s throwing me off is the fact that you are referencing 4 variables for the obs_space. However, I only ever see 2 or 3. And my own looks like: (240, 320, 1).

So if I’m understanding correctly, all I need to do is change my obs_space to (240, 320, num_frames). And this is done in my trainer file that instantiates everything.

Then inside the model (which I would essentially copy paste) is add the 1 line for trajectory view requirements?

So just those 2 minor changes?

Hi @Denys_Ashikhin , so this is even “nicer”, you have only a single channel (black-and-white) so you can just stack the 4 last frames together into the channel dimension.

Doing this should be able via the view requirements. Inside the model you have then to transform the possible shape of (w, h, 1, f) into (w, h, f) to “trick” the VisionNet into working on your stacked frames like it would on color channels.

Hi @Lars_Simon_Zehnder,

Theoretically, I think I’m on board with you, practically, I’m still a tad lost. Based on the last message I don’t even have to touch the trainer file or the environment- it can all be done inside the visionnet model.

Now as for: then to transform the possible shape of (w, h, 1, f) into (w, h, f)
I only found two places in the vissionet that kinda resemble that:

  1. Inside the init
    ray/visionnet.py at 1f7058ee40b32c49b353e8e00806325744122d17 · ray-project/ray · GitHub

and

  1. Inside the forward pass
    ray/visionnet.py at 1f7058ee40b32c49b353e8e00806325744122d17 · ray-project/ray · GitHub

But sadly once more, not too sure how to do it properly. For 1, I am again assuming I need to set my obs_space from
ppo_config.observation_space = Box(low=0, high=1, shape=(y, x, 1), dtype=np.float32) to →
ppo_config.observation_space = Box(low=0, high=1, shape=(y, x, num_frames), dtype=np.float32)

I am guessing I don’t need to touch 2?

And then I add

self.view_requirements["prev_n_obs"] = ViewRequirement(
            data_col="obs", shift="-{}:0".format(num_frames - 1), space=obs_space
        )

To the bottom of the init in visionnet.

Please let me know if that’s correct, or where I’m going wrong :sweat_smile:

@Denys_Ashikhin, so the view requirements loook good to me. In regard to the reshaping you want the observations when they come in into the compute_action to be reshaped before they go into the model. So thi should happen inside the model in the forward() function.

Hi @Lars_Simon_Zehnder,

Looking back on your comment:
they will come as (w, h, c, f) or (f, w, h, c) (I am not sure anymore). You have to convert that for your model.
and
compute_action to be reshaped before they go into the model. So thi should happen inside the model in the forward() function.

Leads me to believe you were talking about this line: ray/visionnet.py at d86624502b4e1f0acb4d68957fca1188947d971f · ray-project/ray · GitHub

# Permuate b/c data comes in as [B, dim, dim, channels]:
self._features = self._features.permute(0, 3, 1, 2)

and after that is the CNN forward pass:
conv_out = self._convs(self._features)

Focusing on: self._features = self._features.permute(0, 3, 1, 2)
we changed the order to [Batches (for us it will be filters?), channels, dimension, dimension].
So I don’t think this is it?

So finally, I am left with self._features = input_dict["obs"].float() → which essentially would be the batch_size x (240,320,1)
This is where I would have to make it (batch_size/num_frames) x (240, 320, num_frames)?
(also related to this: then to transform the possible shape of (w, h, 1, f) into (w, h, f)

@Lars_Simon_Zehnder any further advice/hints?

You could do something like that to stack visual observations using this library:
from collections import deque.

That solution is not perfect, need to swap back to uint8.

Sorry for the late response, some other issues I was narrowing down with ray kept me busy. But can you provide a bit more example/code for the changes necessary to make this work in ray?