Ray tune exceeding memory -- how to set limit?

I’m using Ray Tune with BOHB to tune neural networks. With one architecture (shallow net) I can complete the experiments, but with ResNet the slurm scheduler on our cluster keeps shutting down the experiments because Ray is exceeding (CPU) memory limits. I tried setting the resources in the tuner, but it’s as if Ray is ignoring it.

I’m running on a node with 2 CPUS, 4GPUS. i want to run 2 concurrent trials, 1 GPU per trial. I set up my tuner like this:

tuner = tune.Tuner(
        tune.with_resources(train_candidate, {"cpu":1,"gpu": 1}),
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
            search_alg=algo,
        ),
        run_config=ray.air.RunConfig(
            local_dir="ray_results/",
            name=f"{name}",
            log_to_file=True,
            stop={"training_iteration": max_epochs}
        ),
    )

How can I set memory limits for Ray, so the job doesn’t get killed by my scheduler?

Hi @JuliaWasala

In Ray Tune, we can specify memory limit through

tune.with_resources(train_candidate, {"cpu": 1, "gpu": 1, "memory": <memory_limit_in_bytes>})

However, specifying a memory requirement does NOT impose any limits on memory usage. The requirements are used for admission control during scheduling only (similar to how CPU scheduling works in Ray). It is up to the task itself to not use more memory than it requested. So you should use instances with large RAM or reduce memory usage of your trainable.

More info about Physical Resources and Logical Resources

@JuliaWasala Have you found a solution?
I am currently going to just try to serially train the models (even though it will take a while, but I’m limited by memory as well) by setting cpu to the number of cores I have on my machine.