Delayed Learning Due To Long Episode Lengths

Hi all,

I apologize if the title isn’t the most accurate, however, I am currently training an agent which could take anywhere from 1000-3000 timesteps. I am using the policy client/server setup with a PPO trainer with "batch_mode": "truncate_episodes" and "train_batch_size": 4000. Going with a higher batch_size leads to a OOM memory crash on the server.

My worry/concern: Having 5 agents play the game means that there will a complete training cycle without any episodes being completed. This means that it will train, then once the 5 agents are done their game using the previous model, a new training cycle will begin except it will be based on the results of the previous iteration causing weird results

My questions:

  1. Is my concern valid?
  2. How would batchmode: truncate episodes vs complete episdoes affect training?
  3. If the client was on iteration 1 but the server went through an iteration and is on iteration 2. If the client finishes it episode and pushes, does the server discard that episode since it is based on the old policy? Does this change if the client is halfway through an episode when it gets the new iteration?

Can you elaborate on your client/server setup?
My understanding is that PPO is an on-policy agent. So as long as sample_async config is set to False (which is the default), you are gonna get any training frames collected from previous iterations.

in terms of the difference between “truncate episodes” and “complete episodes”, I find this image explains things pretty good: https://docs.ray.io/en/latest/_images/rllib-batch-modes.svg

Hi @Denys_Ashikhin,

There is one misunderstanding in training revealed in question 3. If you are using the “truncate_episodes” batch_mode, which for your training I think you should, ppo does not wait until an episode ends to train. Instead it samples “train_batch_size” samples from the environment and then it will learn on those samples. You do not need to worry about the relationship between episodes and training time except for in two cases. Case 1 is that you are using complete_episodes and the length of an episode is >> than train_batch_size. In that case you will be training on samples taken from a previous version of the policy. The other is if “rollout_fragment_length” * “num_workers” is >> than train_batch_size. Same issue happens here.

There is one potential issue in your case since you are using the policy server/client setup with the local_inference=True. In that case, as you know, you have a model with the weights that is updated periodically. The server does not push any information to the client which means that the client does not know exactly when the model weights have changed. All information from the server has to be requested by the client. It is possible, and likely, that you will generate some actions using the old version of the weights sometimes. In practice, if you have a relatively small update parameter, I doubt you will have an issue from this.

You could avoid this by using remote rather than local inference because then the server will always produce the actions based on the current model weights. But if you do that you will now have to deal with potentially high communication costs.

Hope that helps. I know there was a lot there so let me know if any of it is confusing.


Your last point clarifies the root of my issue/concern. If the model does generate training data using old weights and later submits that to the server - the server will do an epoch with that data thinking it was using the latest weights correct?

Moreover, my episodes are about ~1,100 steps for now. However, each episode takes like 20 minutes. So my next concern is, even if I use remote interface and eat the communication delays, if we were on policy 1 for 15 minutes then the server updates to policy 2. And a few minutes later the episode ends, it would assume the rewards for that episode to be entirely based on policy 2 even though 75% of it was done using policy 1?

The default update interval for the policy client is 10 seconds. Using that value as an example, there should be at most 10 seconds worth of actions that are generated by the previous version of the policy in the next training update.

My bad, I forgot to add that I have the training size set to about ~4000 steps. Meaning that about 4 games need to finish before training can occur. I at most host 3-4 machines (policy_clients) running at once. This means that in theory, I could be on a policyA_weights for 20-60 minutes before a training cycle starts (which takes ~2 minutes).

And each episode can vary in length by 5-20 minutes as well (depending on how well the AI does). This creates the potential for clientA to be playing for 15 minutes using weightsA. During that time, clientB finished his episode and submited it to the server. That was enough for a server to start training, after which policyB is collectd by all the clients.
Now I have clientA still playing that same episode but 17minutes in it changed to policyB and a few minutes later finishes its episode. My question here is:

  1. Using truncate-episode: the server gets the last subset of the episode as well as the final rewards from the episode completing (I hand out a lot of rewards at the end based on how many rounds it survived and its final placement). However, the server will assume the entire episode (20 minutes) was played using policyB even though most of it was played using policyA?
  2. Using complete-episodes: the server gets the entire episode, except 3/4 of it was played using policyA and the rest using policyB BUT it would train it as if it was on policyB the entire episode?

@Denys_Ashikhin,

This is my understanding of how policy_client works from looking at the code.
Lets say you have the following:
4 workers (each a PolicyClient)
ppo rollout_fragment_length: 200
ppo sample_batch_size: 4000

Each of the 4 workers will sample in parallel and asynchronously 200 steps and report that information to the policy_server. While it is doing that sampling it will request new weights from the policy server every 10 seconds.

When the policy_server has received 4000 samples (notice these are intermixed from each of your 4 environments) it will start training on that batch of samples.

It does not wait for one episode to end before starting the next episode. The server is getting new samples from all of the environments in 200 step chunks. Given the settings you mentioned your first update of the weights should have about 4 episodes worth of data in them.

Here is the issue you are worrying about I think.
While the policy on the server is training, your 4 workers will continue sampling steps from the environment. The time between when learning starts and your policy_client updates its local weights after learning finishes is the time that you will be generating off-policy actions (actions from the previous model that will be used to update the current model). If this time is large then you might have a lot of actions like this but I don’t think that having a few samples like this here and there will matter.

As I think about your environment I wonder if you really need an client server setup at all? Could you not use the conventional rllib way to train and use placement_groups to assign the workers with the environments to the nodes that can run the game and put the driver that is updating weights but not playing games on the node that you train with?

Ngl, you completely lost me on the last paragraph - the thing I want to stress is that I am running DOTA Underlords in real-time for training, so I can only have 1 instance per machine (or unless I virtualize using VM). My ai interfaces with the actual game, hence I assumed externalEnv is appropriate.

I feel like my worry is still valid however. If a policy update is triggered 15 minutes into an episode. Then either the final few steps+reward or the entire episode+reward would be reported as on_policy to the server even though it was off-policy for 15 out of the 20 minutes.

I see, you are asking if there is an issue with part of your episode using one model and another part using its updated version. I don’t know. In practice I think this is fairly common and PPO seems to work fine in that case for many environments. Good luck with yours :slightly_smiling_face:.

You can ignore my external env comment. I was just trying to say that you could probably write it as a normal environment and a voice the pokicy/server setup but I could be totally wrong.

In that case what would your gut say, since training time is stupid long, to see any meaninfgul results takes weeks of training for me.

Should I have it run with truncate or full episode reporting. Also should I have local or remote inference? I would love your thoughts on why as well