Memory Pressure Issue

ShadowDash · January 18, 2023, 7:20pm

Hi,
First - a big disclaimer: I’m only a beginner with rl-lib, I did use the documents and searched for examples online… but I seem to be stuck and I will appreciate any help.

I’m using google colab (pro with high RAM) and gym custom environment in order to tackle a problem.
I have registered my environment and made a zip with all the required dependencies so it can be created remotely.
The environment action space and observation space are rather large (MultiDiscrete observation vector size of 40,000 and MultiDiscrete action vector with size 5000)
I’m trying to run impala, I use the following setup:

algo = (
ImpalaConfig()
.rollouts(num_rollout_workers=1, horizon=5000)
.resources(num_gpus=0)
.training(lr=0.0003, train_batch_size=4 , replay_buffer_num_slots = 2, minibatch_buffer_size = 2)
.environment(env=‘my_env’)
.build()
)

This task above fails at some point before completing, the worker is killed due to memory pressure with the following error:
(raylet) node_manager.cc:3097: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: eef8b40b7555cdc707e6e197c05e25175bed6decdd593e103a125ff6,…

Since I haven’t even started the training, I thought it might be something related to memory allocation, maybe because of the large spaces, that’s why I set a small replay buffers just to test it, but it still runs OOM.

Any help is appreciated, what could have made it go OOM before any training? it was just the setup.

avnishn · January 19, 2023, 6:21pm

Hmm … Are you able to reproduce this issue if you are using any other gym environment?
Can you try using the rllib random env random_env.py - ray-project/ray - Sourcegraph and seeing if you can use this repro your issue? You should be able configure the observation and action spaces via the environment function’s env_config parameter with your environments observation and action spaces.

ShadowDash · January 21, 2023, 10:43pm

Hi,
Thanks for replying!
I have tried the following:
Using RandomEnv as is with small discrete action and observation space and it worked.
Then I tried changing just the action and observation into MultiDiscrete observation vector size of 40,000 and MultiDiscrete action vector with size 5000 exactly like in my problem, and the issue is reproduced.
Might be worth mentioning that I’m using colab pro with additional memory.
Is there any way for me to handle such large spaces?

Thanks again!

*Update - I tried to isolate the source of the issue further, the large observation space seems to work if the action space is up size of around 100, which is far from the 5000 I need… any suggestions are appreciated.

Huaiwei_Sun · February 9, 2023, 7:44pm

@avnishn @sven1977 @arturn for ideas…

avnishn · February 9, 2023, 8:40pm

can you share a link to your repro script or paste it here?

If I can reproduce it on my end then I can potentially get you a fix / work around

ShadowDash · February 10, 2023, 5:07pm

@avnishn yes, thank you.
So the following will be enough to reproduce the issue:

class GymLearnEnv(Env):

  def __init__(self, config=None):
    self.observation_space = spaces.Box(low=0, high=100, shape=(76032,))
    self.action_space = spaces.MultiDiscrete(np.full((5000), 10))
    self.count = 0

  def step(self, action):
    done=False
    self.count = self.count + 1
    reward = 0
    if self.count==500:
      done=True
    info = {}
    return self.observation_space.sample(), reward, done, info

  def render(self):
    pass

  def reset(self):
    self.count = 0
    return self.observation_space.sample()

class WrapperEnv(Env):

  def __init__(self, config=None):
      self.env = GymLearnEnv()
      self.reset_count = -1
      self.action_space = self.env.action_space
      self.observation_space = self.env.observation_space
  
  def reset(self):
      self.reset_count += 1
      return self.env.reset()
  
  def step(self, action):
      return self.env.step(action=action)

register_env(“my_env”, GymLearnEnv)

algo = (
ImpalaConfig()
.rollouts(num_rollout_workers=1, horizon=500)
.resources(num_gpus=0)
.training(lr=0.0001 , replay_buffer_num_slots = 5)
.environment(env=‘my_env’)
.build()
)

The problem is the action_space size, you could potentially use it in any environment you have and you’ll encounter the same issue.
I tried to cut it down to half, the memory still fills up to the nearly max, making training very slow with a single worker and the loading time of the policy afterwards takes too long to load (around 5 minutes) with the function Policy.from_checkpoint().
I’m using colab pro with high ram setting without gpu for now.
My question is how to handle such large action space? I’m clearly doing something wrong.
Thanks!

avnishn · February 18, 2023, 2:25am

I wasn’t able to get an exact answer for you, but I have a rough idea of what the problem is.

The train_batch_size by default in impala is 500. Additionally there is a queue for holding samples that are going to be trained on by a learning thread. On the back of an envelope, each batch of 500 is about a 1gb of data large. If your learner queue gets filled up, which is totally possible since you are on a google colab machine where resources are not exactly plentiful afaict ( a quick google search tells me that you get 25 gb or ram)

so alone if this queue gets filled up, the memory ussage will already be around 20 gigs. Couple that with the size of the ray object store, and that is probably why you are getting ooms.

I would suggest that you start by decreasing the train_batch_size to half (the default is 500), half your learner_queue_size to 8 ,
and decrease your rollout_fragment_length so that the the size of samples of flight is small and you don’t get a oom in your ray object store (the default here is 50. Turn it down to some factor of your train_batch_size).

If that doesn’t work, keep tuning these down by scales of 2 until it does.

ShadowDash · February 22, 2023, 12:36am

@avnishn Thank you, it makes perfect sense.
I have managed to play around with the parameters that you suggested and I can get my model up and running now… thanks
New problem that arises from the same scenario of complex action space is it now during training the mean_inference_ms is very high (around 1000).
Is there any tip that you can think of that might help tackle that difficulty?

avnishn · February 22, 2023, 1:28am

for actions you can reduce your space by using action masking.

There’s also the matter of your observations that are very large.

I guess the question here is, can you afford in some way to discretize your action and observation spaces further?

I doubt that at this size you’ll be able to train a policy to get any meaningful output without any additional tricks on your end to reduce the dimensions of your problem.

Can I ask, what is the problem that you are trying to phase as an RL problem? Its likely that using some heuristic we can decrease the dimensions.

ShadowDash · February 22, 2023, 5:23pm

@avnishn Thanks for the input!
unfortunately I can’t share more details about the specific use-case since its not a personal project.
I was just reading about action masking following the explanation here: ray/action_mask_model.py at master · ray-project/ray · GitHub
I’m still unsure how to use action masking since I’m a novice, but from what I read so far - it doesn’t reduce the action space… its supposed to help in effective learning, will it actually reduce inference time for the agent?
Its a great solution if it speeds up the inference time since around 90% will be masked so its a great potential.
The observation space can be seen as an image with around 30 channels, each channel is a layer of information, I was thinking to try VisionNetworks as model instead of fc, do you think it might be effective? could it speed up the inference?

Thanks again for helping!

Topic		Replies	Views
Impala Bugs and some other observations RLlib	9	1084	April 27, 2023
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1222	May 29, 2023
Observation_space not provided in PolicySpec RLlib	21	7398	February 7, 2023
IMPALA agent not working RLlib	1	324	January 9, 2023
Memory issue debugging RLlib	7	1243	September 25, 2022

Memory Pressure Issue

Related topics