I’m trying to run TuneSearchCV with a modest number of trials but the server memory is quickly exhausted and a crash is imminent. The server is quite large, so I wouldn’t have thought it would be an issue. Is there a way to limit how much memory this TuneSearchCV can use?
Can you provide some context about the issue that you’re seeing?
Is Ray killing the server with an OOM detection? How big is your data? can you perhaps limit the amount of parallelism?
Sure! Here you go.
TuneSearchCV params: {‘n_jobs’: -1, ‘n_trials’: 10, ‘search_optimization’: ‘bayesian’, ‘cv’: 5}
In the past I have also received a 10% and 5% memory warning from Ray before the crashes with an OOM error, but getting this warning and subsequent error isn’t guaranteed in my experience. The training data is not particularly large, perhaps up to 100MB max.
2021-02-19 10:19:04,974 ERROR trial_runner.py:613 -- Trial _Trainable_7e1b3676: Error processing event.
Traceback (most recent call last):
File "/home/leb/anaconda3/envs/new_editing/bin/editing", line 33, in <module>
sys.exit(load_entry_point('editing', 'console_scripts', 'editing')())
File "/home/leb/editing/editing/cli.py", line 31, in main
COMMANDS[args.command]()
File "/home/leb/editing/editing/pipeline/pipeline.py", line 251, in main
**config)
File "/home/leb/editing/editing/pipeline/util.py", line 97, in run
ret = func(*args, **kwargs)
File "/home/leb/editing/editing/pipeline/pipeline.py", line 205, in train
clf.fit(articles_train, train['label'])
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py", line 664, in fit
result = self._fit(X, y, groups, **fit_params)
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py", line 565, in _fit
analysis = self._tune_run(config, resources_per_trial)
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/tune_sklearn/tune_search.py", line 715, in _tune_run
analysis = tune.run(trainable, **run_args)
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
runner.step()
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 402, in step
self._process_events(timeout=timeout) # blocking
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 560, in _process_events
self._process_trial(trial)
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::_Trainable.train_buffered() (pid=9157, ip=10.10.0.40)
File "python/ray/_raylet.pyx", line 440, in ray._raylet.execute_task
File "/home/leb/anaconda3/envs/new_editing/lib/python3.7/site-packages/ray/memory_monitor.py", line 132, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node dlaiapp03 is used (119.83 / 125.82 GB). The top 10 memory consumers are:
PID MEM COMMAND
9035 6.62GiB /home/leb/anaconda3/envs/new_editing/bin/python /home/leb/anaconda3/envs/new_editing/bin/editing tra
9158 1.2GiB ray::_Trainable.train_buffered()
9160 1.2GiB ray::_Trainable.train_buffered()
9157 1.17GiB ray::_Trainable
9221 1.17GiB ray::_Trainable.train_buffered()
9212 1.17GiB ray::_Trainable
4564 1.16GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4631 1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4512 1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
4627 1.15GiB /home/leb/anaconda3/envs/new_editing/bin/python -m joblib.externals.loky.backend.popen_loky_posix --
In addition, up to 0.04 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
---
Hmm, that seems very odd - do you have a script I can run for reproduction?
What if you try n_jobs=2?
It’s also odd that the loky thing shows up in the process list
It still crashes with n_jobs=2, however I haven’t been able to reproduce the error with a much simpler script so it seems to be my issue. I load some large text embedding objects in memory before calling TuneSearchCV (although they are not directly used by the model I’m fitting), does Ray copy the local variables to each worker environment? Perhaps this is why the memory is blowing up?
Hmm, it shouldn’t be the case that the large embedding object gets moved. (Ray doesn’t copy local variables to each worker env).
If you comment out the text embedding object, does it still cause memory usage?