Ray tune exceeding memory -- how to set limit?

JuliaWasala · May 8, 2023, 2:31pm

I’m using Ray Tune with BOHB to tune neural networks. With one architecture (shallow net) I can complete the experiments, but with ResNet the slurm scheduler on our cluster keeps shutting down the experiments because Ray is exceeding (CPU) memory limits. I tried setting the resources in the tuner, but it’s as if Ray is ignoring it.

I’m running on a node with 2 CPUS, 4GPUS. i want to run 2 concurrent trials, 1 GPU per trial. I set up my tuner like this:

tuner = tune.Tuner(
        tune.with_resources(train_candidate, {"cpu":1,"gpu": 1}),
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
            search_alg=algo,
        ),
        run_config=ray.air.RunConfig(
            local_dir="ray_results/",
            name=f"{name}",
            log_to_file=True,
            stop={"training_iteration": max_epochs}
        ),
    )

How can I set memory limits for Ray, so the job doesn’t get killed by my scheduler?

yunxuanx · May 8, 2023, 6:10pm

Hi @JuliaWasala

In Ray Tune, we can specify memory limit through

tune.with_resources(train_candidate, {"cpu": 1, "gpu": 1, "memory": <memory_limit_in_bytes>})

However, specifying a memory requirement does NOT impose any limits on memory usage. The requirements are used for admission control during scheduling only (similar to how CPU scheduling works in Ray). It is up to the task itself to not use more memory than it requested. So you should use instances with large RAM or reduce memory usage of your trainable.

More info about Physical Resources and Logical Resources

aoot · December 10, 2024, 6:27pm

@JuliaWasala Have you found a solution?
I am currently going to just try to serially train the models (even though it will take a while, but I’m limited by memory as well) by setting cpu to the number of cores I have on my machine.

Topic		Replies	Views
Specifying memory requirement for RLlib algorithms in Ray Tune etc RLlib	3	389	January 7, 2023
Most runs immediately failing with "out of memory" Ray Tune	5	1223	May 11, 2021
How to correclty allocate resources with Tune + TorchTrainer on Slurm	2	447	December 20, 2022
Ray Out of Memory Issue Ray Tune	1	196	April 30, 2024
How to make all use of the GPU memory in Ray.tune	6	1315	December 6, 2022

Ray tune exceeding memory -- how to set limit?

Related topics