I’m using Ray Tune with BOHB to tune neural networks. With one architecture (shallow net) I can complete the experiments, but with ResNet the slurm scheduler on our cluster keeps shutting down the experiments because Ray is exceeding (CPU) memory limits. I tried setting the resources in the tuner, but it’s as if Ray is ignoring it.
I’m running on a node with 2 CPUS, 4GPUS. i want to run 2 concurrent trials, 1 GPU per trial. I set up my tuner like this:
tuner = tune.Tuner(
tune.with_resources(train_candidate, {"cpu":1,"gpu": 1}),
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
scheduler=scheduler,
num_samples=num_samples,
search_alg=algo,
),
run_config=ray.air.RunConfig(
local_dir="ray_results/",
name=f"{name}",
log_to_file=True,
stop={"training_iteration": max_epochs}
),
)
How can I set memory limits for Ray, so the job doesn’t get killed by my scheduler?