Prioritize paused trials over starting new ones

JohannKaspar · November 13, 2024, 9:28am

Dear Ray community,

I am trying out to train a model with hyperparameters including batch size, layer size, layer channels and so on. Thus my models have varying requirements of GPU memory. In order to speed up things I plan on starting all trials with a logic GPU requirement of 0.4 and monitor the max memory requirement in each run. When reporting to Ray, I also log the memory requirements. I adopted the resource changing scheduler to check whether the memory was lower than what has been allocated.

def resources_allocation_function(tune_controller, trial, result, scheduler):
    """
    Adjusts resource allocation for a trial based on GPU utilization.

    Args:
        tune_controller: Trial runner for this Tune run.
        trial: The trial to allocate new resources to.
        result: The latest results of the trial.
        scheduler: The scheduler calling the function.
    """
    if result["training_iteration"] < 1:
        return None

    base_resources = scheduler._base_trial_resources or PlacementGroupFactory([{"CPU": 1, "GPU": 0}])
    current_gpu = base_resources.required_resources.get("GPU", 0)
    peak_gpu_memory = result.get("peak_gpu_mem", 0)
    total_gpu_memory = result.get("total_gpu_mem", 1)

    gpu_utilization = peak_gpu_memory / total_gpu_memory
    if gpu_utilization >= 0.75 * current_gpu:
        print("Not changing GPU resource allocation")
        return None

    new_gpu_allocation = min(round(gpu_utilization * 1.2, 3), 1)
    print(f"Changing GPU allocation from {current_gpu} to {new_gpu_allocation}")
    return PlacementGroupFactory([{"CPU": base_resources.required_resources.get("CPU", 1), "GPU": new_gpu_allocation}])

   optuna_search = OptunaSearch(
        metric="validation/loss",
        mode="min"
    )
    
    asha = ASHAScheduler(
        time_attr="training_iteration",
        max_t=config["EPOCHS"],
        metric="validation/loss",
        mode="min",
        grace_period=config["GRACE_PERIOD"],
        reduction_factor=config["REDUCTION_FACTOR"]
    )
    
    scheduler = ResourceChangingScheduler(
        base_scheduler=asha,
        resources_allocation_function=resources_allocation_function,
    )


    tuner = tune.Tuner(
        tune.with_resources(
            tune.with_parameters(trainable),
            resources={
                "cpu": config["CPUS_PER_TRIAL"],
                "gpu": config["GPUS_PER_TRIAL"]
            }
        ),
        tune_config=tune.TuneConfig(
            max_concurrent_trials=config["MAX_CONCURRENT"],
            search_alg=optuna_search,
            scheduler=scheduler,
            num_samples=config["NUM_EXPERIMENTS"],
            time_budget_s=3600 * 12  # 12 hours
        ),
        run_config=RunConfig(
            storage_path=config["RAY_RESULTS_DIR"],
            failure_config=FailureConfig(max_failures=2),
            checkpoint_config=CheckpointConfig(
                num_to_keep=2,
                checkpoint_score_attribute="validation/loss",
                checkpoint_score_order="min",
            )
        ),
        param_space=config,
    )
    results = tuner.fit()

The reallocation works but here comes the first problem:
All trials are now being paused (have to be checkpointed and restarted as I understood from the docs). Now i end up with 90% of my started trials being paused until the last one has been started, before the paused trials continue to be executed. I would like to prioritize resuming paused trials over starting new ones. Is there an option to do this?

Now to my second question:
I used this code before to wrap my trainable in order to prevent OOM errors.

if config["DEVICE"] == "cuda":
        def tune_func(*args, **kwargs):
            tune.utils.wait_for_gpu(
                target_util=1-config["GPUS_PER_TRIAL"]  # utilization threshold to reach to unblock
            )
            train_model(*args, **kwargs)
        trainable = tune_func
    else:
        trainable=train_model

Is it possible to change the target_util to be something like „1 - trial._required_gpu_resources“? Otherwise the GPU would not be fully filled with jobs.

Best
Johann

Topic		Replies	Views
GPU memory not released Ray Tune	13	1433	November 13, 2023
How to allocate specific gpu to trials when use ray tune Ray Tune	4	505	July 25, 2021
Most runs immediately failing with "out of memory" Ray Tune	5	1215	May 11, 2021
Cutting of Pending Trials due to high CPU RAM utilization	0	278	June 5, 2023
Adding memory in resources_per_trial in tune.run() hangs	2	399	October 28, 2022

Prioritize paused trials over starting new ones

Related topics