Changing add_time_dimension logic

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

The function add_time_dimension is responsible for creating training batches in recurrent networks in rllib. It divides the samples based on the max_seq_len parameter. For example, if we have 100 samples and max_seq_len is set to 30, we will obtain 4 batches, each with a length of 30 (20 samples will be padding). Although the current approach is not incorrect, I believe it might be better to create batches where all samples are equal to max_seq_len. In this case, with 100 samples and max_seq_len set to 30, the batches would be [100, 30, f], as opposed to [4, 30, f].

There are trade-offs to consider in this approach. While it may make the training process smoother because all steps are treated equally, similar to non-recurrent nets, it would also increase the computational cost of the model.
I just wanna know your thoughts on this!

How would you implement the RNN in this case? This seems like it would significantly increase the memory and compute required for the model. Do you know if this would improve the learning throughput?

to implement RNN in this case you should change model view requirements to add past observations(equal to max_seq_len) to your model and then reshape all observations to [samples, seq_len, features].
Yes it significantly increase memory but my question is : does it make model work better? I don’t know if it increase learning throughput or not but intuitively for me it should increase, because you are passing more batches to model (like 100vs4 in my example) so it should worth to use more memory.
the very important point for me is in rllib implementation the model learns to act as it sees observations sequences from zero to max_seq_len, it’s like adding much noise to model, noises also could be beneficial like we add noise in image classification but it is double edge sword, it can also interrupt model learning.
I want to know is there any problem beside memory in the implementation which I proposed?
because it’s make more sense to me. thanks

Hi @hossein836,

If I understand your suggestion correctly, I think you could already do this by setting your training batch size to (100*max_seq_len). You can actually end up with a few more than 100 in the batch dimension if some of the episodes are less than the max_seq_len.

Hi @mannyv
You are correct but my point was not that.
I mean this:

The second method is what I proposed

“I don’t want to increase the batch size, as it would also increase RAM usage. My method focuses on gaining more knowledge within the same batch size instead of changing it. However, the downside of this approach is that it will increase GPU RAM usage.”

Hi @hossein836,

What you want to do is fair but it should not be the default behavior of LSTMs in RLlib. You can achieve custom sample collection via setting a custom trajectory view for your custom model and obtain what you have in mind. I don’t know the exact syntax but I believe it should be possible to achieve it via this API.
Reference: Sample Collections and Trajectory Views — Ray 2.5.1
Something like:
{"slided_obs": ViewRequirement("obs", shift="-3:-1", used_for_training=True)}

Damet Garm @kourosh jan :wink:
I agree with your proposed syntax; that should be fine. However, while I appreciate your hard works, from the perspective of reinforcement learning theories, I have these questions:

1- Do you also believe my method can potentially increase learning throughput ?
2- Do you also believe my method can potentially decrease instability of learning ?
3- What are the perks of rllib implementation in contrast to mine? Is it just hardware restrictions, or are there some RL theories that I don’t understand (hence you said “it shouldn’t be default”?

@hossein836 jan,

You are not really increasing throughput this way, you are basically increasing the gradient update intensity. This means that you are use more samples to update the network. Whether that works better or not in practice, really depends on the particular use-case. So the short answer is that you gotta try and see. My hypothesis is that it won’t really help in general, because on-policy methods like PPO need to move on from their bad initial randomness and if you train them with more intensity when the policy is not so much better than random can actually cause convergence issues.

The implementation in RLlib is a simple extension of the basic algorithms and does not introduce these possible un-wanted effects from increasing training intensity.

1 Like

@kourosh AALI
I understand your point. I was considering using a buffer to address the convergence issue you mentioned, specifically sampling x experiences out of y episodes. As far as I know, there is currently no buffer implementation for PPO in rllib. However, the OpenAI Five, which is a PPO agent and RNN model, used a buffer. If I don’t find any information on how to use a buffer in PPO, I will create a new topic.
many thanks :slightly_smiling_face: