Hi,
First - a big disclaimer: I’m only a beginner with rl-lib, I did use the documents and searched for examples online… but I seem to be stuck and I will appreciate any help.
I’m using google colab (pro with high RAM) and gym custom environment in order to tackle a problem.
I have registered my environment and made a zip with all the required dependencies so it can be created remotely.
The environment action space and observation space are rather large (MultiDiscrete observation vector size of 40,000 and MultiDiscrete action vector with size 5000)
I’m trying to run impala, I use the following setup:
algo = (
ImpalaConfig()
.rollouts(num_rollout_workers=1, horizon=5000)
.resources(num_gpus=0)
.training(lr=0.0003, train_batch_size=4 , replay_buffer_num_slots = 2, minibatch_buffer_size = 2)
.environment(env=‘my_env’)
.build()
)
This task above fails at some point before completing, the worker is killed due to memory pressure with the following error:
(raylet) node_manager.cc:3097: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: eef8b40b7555cdc707e6e197c05e25175bed6decdd593e103a125ff6,…
Since I haven’t even started the training, I thought it might be something related to memory allocation, maybe because of the large spaces, that’s why I set a small replay buffers just to test it, but it still runs OOM.
Any help is appreciated, what could have made it go OOM before any training? it was just the setup.