Memory explosion with TuneSearchCV

I’m trying to run TuneSearchCV with a modest number of trials but the server memory is quickly exhausted and a crash is imminent. The server is quite large, so I wouldn’t have thought it would be an issue. Is there a way to limit how much memory this TuneSearchCV can use?

Can you provide some context about the issue that you’re seeing?

Is Ray killing the server with an OOM detection? How big is your data? can you perhaps limit the amount of parallelism?

Sure! Here you go.

TuneSearchCV params: {‘n_jobs’: -1, ‘n_trials’: 10, ‘search_optimization’: ‘bayesian’, ‘cv’: 5}

In the past I have also received a 10% and 5% memory warning from Ray before the crashes with an OOM error, but getting this warning and subsequent error isn’t guaranteed in my experience. The training data is not particularly large, perhaps up to 100MB max.


2021-02-19 10:19:04,974 ERROR trial_runner.py:613 -- Trial _Trainable_7e1b3676: Error processing event.
Traceback (most recent call last):
  File "/home/leb/anaconda3/envs/new_editing/bin/editing", line 33, in <module>
    sys.exit(load_entry_point('editing', 'console_scripts', 'editing')())
  File "/home/leb/editing/editing/cli.py", line 31, in main
    COMMANDS[args.command]()
  File "/home/leb/editing/editing/pipeline/pipeline.py", line 251, in main
    **config)
  File "/home/leb/editing/editing/pipeline/util.py", line 97, in run
    ret = func(*args, **kwargs)
  File "/home/leb/editing/editing/pipeline/pipeline.py", line 205, in train
    clf.fit(articles_train, train['label'])
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py", line 664, in fit
    result = self._fit(X, y, groups, **fit_params)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py", line 565, in _fit
    analysis = self._tune_run(config, resources_per_trial)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_search.py", line 715, in _tune_run
    analysis = tune.run(trainable, **run_args)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 402, in step
    self._process_events(timeout=timeout)  # blocking
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 560, in _process_events
    self._process_trial(trial)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::_Trainable.train_buffered() (pid=9157, ip=10.10.0.40)
  File "python/ray/_raylet.pyx", line 440, in ray._raylet.execute_task
  File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/memory_monitor.py", line 132, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node dlaiapp03 is used (119.83 / 125.82 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
9035    6.62GiB /home/leb/anaconda3/envs/new_editing/bin/python /home/leb/anaconda3/envs/new_editing/bin/editing tra
9158    1.2GiB  ray::_Trainable.train_buffered()
9160    1.2GiB  ray::_Trainable.train_buffered()
9157    1.17GiB ray::_Trainable
9221    1.17GiB ray::_Trainable.train_buffered()
9212    1.17GiB ray::_Trainable
4564    1.16GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4631    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4512    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4627    1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --

In addition, up to 0.04 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
---

Hmm, that seems very odd - do you have a script I can run for reproduction?

What if you try n_jobs=2?

It’s also odd that the loky thing shows up in the process list

It still crashes with n_jobs=2, however I haven’t been able to reproduce the error with a much simpler script so it seems to be my issue. I load some large text embedding objects in memory before calling TuneSearchCV (although they are not directly used by the model I’m fitting), does Ray copy the local variables to each worker environment? Perhaps this is why the memory is blowing up?

Hmm, it shouldn’t be the case that the large embedding object gets moved. (Ray doesn’t copy local variables to each worker env).

If you comment out the text embedding object, does it still cause memory usage?