Memory explosion with TuneSearchCV

I’m trying to run TuneSearchCV with a modest number of trials but the server memory is quickly exhausted and a crash is imminent. The server is quite large, so I wouldn’t have thought it would be an issue. Is there a way to limit how much memory this TuneSearchCV can use?

Can you provide some context about the issue that you’re seeing?

Is Ray killing the server with an OOM detection? How big is your data? can you perhaps limit the amount of parallelism?

Sure! Here you go.

TuneSearchCV params: {‘n_jobs’: -1, ‘n_trials’: 10, ‘search_optimization’: ‘bayesian’, ‘cv’: 5}

In the past I have also received a 10% and 5% memory warning from Ray before the crashes with an OOM error, but getting this warning and subsequent error isn’t guaranteed in my experience. The training data is not particularly large, perhaps up to 100MB max.

2021-02-19 10:19:04,974 ERROR -- Trial _Trainable_7e1b3676: Error processing event.
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node dlaiapp03 is used (119.83 / 125.82 GB). The top 10 memory consumers are:

9035    6.62GiB /home/leb/anaconda3/envs/new_editing/bin/python /home/leb/anaconda3/envs/new_editing/bin/editing tra
9158    1.2GiB  ray::_Trainable.train_buffered()
9160    1.2GiB  ray::_Trainable.train_buffered()
9157    1.17GiB ray::_Trainable
9221    1.17GiB ray::_Trainable.train_buffered()
9212    1.17GiB ray::_Trainable
4564    1.16GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4631    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4512    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4627    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --

In addition, up to 0.04 GiB of shared memory is currently being used by the Ray object store.
--- Tip: Use the `ray memory` command to list active objects in the cluster.

Hmm, that seems very odd - do you have a script I can run for reproduction?

What if you try n_jobs=2?

It’s also odd that the loky thing shows up in the process list

It still crashes with n_jobs=2, however I haven’t been able to reproduce the error with a much simpler script so it seems to be my issue. I load some large text embedding objects in memory before calling TuneSearchCV (although they are not directly used by the model I’m fitting), does Ray copy the local variables to each worker environment? Perhaps this is why the memory is blowing up?

Hmm, it shouldn’t be the case that the large embedding object gets moved. (Ray doesn’t copy local variables to each worker env).

If you comment out the text embedding object, does it still cause memory usage?