Cutting of Pending Trials due to high CPU RAM utilization

lorenzobrigato · June 5, 2023, 4:01pm

Hello,

I am running some experiments with Tune on 16 CPU cores and 1 GPU machine and configured --cpus-per-trial 5 and --gpus-per-trial 0.5. I am obliged to set half of the GPU available for each trial since I notice a severe CPU RAM overhead because of the tune.run( ) call leaves many pending trials (17).
In the system monitor, I indeed see a rapid growth of memory and some processes that occupy RAM but are not actually being run since only two at the time run on GPU.

So basically, I can only run two trials in parallel since my machine will run out of CPU RAM rather than GPU RAM. I would like to decrease the number of pending trials to exploit the GPU at its full potential.

I checked the documentation and found that the environment variable TUNE_MAX_PENDING_TRIALS_PG could be changed to adapt the behavior.
So far, I haven’t succeeded, though, and I would greatly appreciate any help.

E.g., the naive modification at the top of the script, like in the following, doesn’t change anything:

os.environ["TUNE_MAX_PENDING_TRIALS_PG"] = "1"
...
    result = tune.run(
        ray_train_classifier,
        name=args.exp_name,
        resources_per_trial={"cpu": args.cpus_per_trial, "gpu": args.gpus_per_trial},
        config=config,
        num_samples=args.num_trials,
        progress_reporter=reporter,
        local_dir=args.log_dir,
        checkpoint_at_end=args.checkpoint_at_end,
        resume=args.resume
    )

Topic		Replies	Views
Prioritize paused trials over starting new ones Ray Tune	0	10	November 13, 2024
GPU memory not released Ray Tune	13	1445	November 13, 2023
Ray Tune not running enouch processes Ray Tune	1	334	November 29, 2022
Training trials in parallel on multi-gpu machine Ray Tune	8	1676	August 23, 2021
Multiple trials on each GPU Ray Tune	1	481	February 19, 2021

Cutting of Pending Trials due to high CPU RAM utilization

Related topics