Hello,
I’m trying to run a simple Ray Tune pipeline for hyperparameter optimization - all on one node for now. This is a simple, CPU-only model.
When I run with a small enough number of samples so that my node’s RAM can handle all the jobs (e.g. 2), things run to completion. However, when I increase the number of samples past the point where they can all run at once in terms of memory usage the jobs don’t queue like I would expect. Instead they all run at once and most of them crash with “OSError: [Errno 12] Cannot allocate memory” and the ray status looks something like this:
Here’s what my tune.run() code looks like:
asha_scheduler = ASHAScheduler(time_attr = 'training_iteration', metric = 'valid_recall', mode = 'max')
reporter = CLIReporter(metric_columns=["valid_recall"])
ray.init(dashboard_port=6007)
result = tune.run(
train,
local_dir=config['out_directory'],
resources_per_trial={"cpu": 1, "gpu": 0},
config=config,
num_samples=15,
progress_reporter=reporter,
scheduler=asha_scheduler,
verbose = 3,
queue_trials = True
)
config is a dictionary with both the train parameters that don’t change as well as the ones for Ray Tune to test. I’ve tried with and without queue_trials. Sorry if I’m missing something basic here…is there a way I can get all trials to run, only running enough at a time to stay within the node’s memory?
Thank you very much