Memory explosion with TuneSearchCV

Lewis_Bails · February 19, 2021, 8:20am

I’m trying to run TuneSearchCV with a modest number of trials but the server memory is quickly exhausted and a crash is imminent. The server is quite large, so I wouldn’t have thought it would be an issue. Is there a way to limit how much memory this TuneSearchCV can use?

rliaw · February 19, 2021, 8:27am

Can you provide some context about the issue that you’re seeing?

Is Ray killing the server with an OOM detection? How big is your data? can you perhaps limit the amount of parallelism?

Lewis_Bails · February 19, 2021, 9:08am

Sure! Here you go.

TuneSearchCV params: {‘n_jobs’: -1, ‘n_trials’: 10, ‘search_optimization’: ‘bayesian’, ‘cv’: 5}

In the past I have also received a 10% and 5% memory warning from Ray before the crashes with an OOM error, but getting this warning and subsequent error isn’t guaranteed in my experience. The training data is not particularly large, perhaps up to 100MB max.


2021-02-19 10:19:04,974 ERROR trial_runner.py:613 -- Trial _Trainable_7e1b3676: Error processing event.
Traceback (most recent call last):
  File "/home/leb/anaconda3/envs/new_editing/bin/editing", line 33, in <module>
    sys.exit(load_entry_point('editing', 'console_scripts', 'editing')())
  File "/home/leb/editing/editing/cli.py", line 31, in main
    COMMANDS[args.command]()
  File "/home/leb/editing/editing/pipeline/pipeline.py", line 251, in main
    **config)
  File "/home/leb/editing/editing/pipeline/util.py", line 97, in run
    ret = func(*args, **kwargs)
  File "/home/leb/editing/editing/pipeline/pipeline.py", line 205, in train
    clf.fit(articles_train, train['label'])
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py", line 664, in fit
    result = self._fit(X, y, groups, **fit_params)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py", line 565, in _fit
    analysis = self._tune_run(config, resources_per_trial)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_search.py", line 715, in _tune_run
    analysis = tune.run(trainable, **run_args)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 402, in step
    self._process_events(timeout=timeout)  # blocking
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 560, in _process_events
    self._process_trial(trial)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::_Trainable.train_buffered() (pid=9157, ip=10.10.0.40)
  File "python/ray/_raylet.pyx", line 440, in ray._raylet.execute_task
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/memory_monitor.py", line 132, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node dlaiapp03 is used (119.83 / 125.82 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
9035    6.62GiB /home/leb/anaconda3/envs/new_editing/bin/python /home/leb/anaconda3/envs/new_editing/bin/editing tra
9158    1.2GiB  ray::_Trainable.train_buffered()
9160    1.2GiB  ray::_Trainable.train_buffered()
9157    1.17GiB ray::_Trainable
9221    1.17GiB ray::_Trainable.train_buffered()
9212    1.17GiB ray::_Trainable
4564    1.16GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4631    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4512    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4627    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --

In addition, up to 0.04 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
---

rliaw · February 19, 2021, 9:32am

Hmm, that seems very odd - do you have a script I can run for reproduction?

rliaw · February 19, 2021, 9:32am

What if you try n_jobs=2?

It’s also odd that the loky thing shows up in the process list

Lewis_Bails · February 19, 2021, 10:07am

It still crashes with n_jobs=2, however I haven’t been able to reproduce the error with a much simpler script so it seems to be my issue. I load some large text embedding objects in memory before calling TuneSearchCV (although they are not directly used by the model I’m fitting), does Ray copy the local variables to each worker environment? Perhaps this is why the memory is blowing up?

rliaw · February 19, 2021, 8:12pm

Hmm, it shouldn’t be the case that the large embedding object gets moved. (Ray doesn’t copy local variables to each worker env).

If you comment out the text embedding object, does it still cause memory usage?

Topic		Replies	Views
Ray using so much memory I cannot even start the tuning Ray Tune	5	2347	April 24, 2023
Most runs immediately failing with "out of memory" Ray Tune	5	1237	May 11, 2021
Adding memory in resources_per_trial in tune.run() hangs	2	414	October 28, 2022
Ray tune exceeding memory -- how to set limit? Ray Tune	2	1106	December 10, 2024
Ray Out of Memory Issue Ray Tune	1	201	April 30, 2024

Memory explosion with TuneSearchCV

Related topics