Most runs immediately failing with "out of memory"

Hello,

I’m trying to run a simple Ray Tune pipeline for hyperparameter optimization - all on one node for now. This is a simple, CPU-only model.

When I run with a small enough number of samples so that my node’s RAM can handle all the jobs (e.g. 2), things run to completion. However, when I increase the number of samples past the point where they can all run at once in terms of memory usage the jobs don’t queue like I would expect. Instead they all run at once and most of them crash with “OSError: [Errno 12] Cannot allocate memory” and the ray status looks something like this:

Here’s what my tune.run() code looks like:

asha_scheduler = ASHAScheduler(time_attr = 'training_iteration', metric = 'valid_recall', mode = 'max')

reporter = CLIReporter(metric_columns=["valid_recall"])

ray.init(dashboard_port=6007)
result = tune.run(
	train,
	local_dir=config['out_directory'],
	resources_per_trial={"cpu": 1, "gpu": 0},
	config=config,
	num_samples=15,
	progress_reporter=reporter,
	scheduler=asha_scheduler,
	verbose = 3,
	queue_trials = True
) 

config is a dictionary with both the train parameters that don’t change as well as the ones for Ray Tune to test. I’ve tried with and without queue_trials. Sorry if I’m missing something basic here…is there a way I can get all trials to run, only running enough at a time to stay within the node’s memory?

Thank you very much

Hi @gkreder, can you try using a ConcurrencyLimiter to limit the number of concurrently running trials?

queue_trials is only used to trigger autoscaling behavior on a cluster and will be deprecated soon.

1 Like

I’m sorry, is there a way I can wrap ConcurrencyLimiter around a basic random searcher?

I’m using

tune.sample_from

in my configs so can’t use HyperOpt for the search_alg. I have this right now:

search_alg = tune.create_searcher('random')
search_alg = ConcurrencyLimiter(search_alg, max_concurrent=2)

But I get the error

Hm, it seems that’s a bug. I’ll fix this tomorrow. In the meantime, can you try this hack:

search_alg = tune.suggest.basic_variant.BasicVariantGenerator()
search_alg.mode = None
search_alg = ConcurrencyLimiter(search_alg, max_concurrent=2)

Alternatively, set the cpus in resources_per_trial to 16, this will also limit the number of CPUs that are used (the CPUs are just reserved and won’t actually be used, so other tasks still have access to them).

1 Like

Awesome, thank you very much @kai! In the meantime, setting cpus 16 per trial or using HyperOpt + ConcurrencyLimiter seems to be working

Just for reference, this has been fixed here: [tune] add `max_concurrent` option to BasicVariantGenerator by krfricke · Pull Request #15680 · ray-project/ray · GitHub

1 Like