Prioritize paused trials over starting new ones

Dear Ray community,

I am trying out to train a model with hyperparameters including batch size, layer size, layer channels and so on. Thus my models have varying requirements of GPU memory. In order to speed up things I plan on starting all trials with a logic GPU requirement of 0.4 and monitor the max memory requirement in each run. When reporting to Ray, I also log the memory requirements. I adopted the resource changing scheduler to check whether the memory was lower than what has been allocated.

def resources_allocation_function(tune_controller, trial, result, scheduler):
    """
    Adjusts resource allocation for a trial based on GPU utilization.

    Args:
        tune_controller: Trial runner for this Tune run.
        trial: The trial to allocate new resources to.
        result: The latest results of the trial.
        scheduler: The scheduler calling the function.
    """
    if result["training_iteration"] < 1:
        return None

    base_resources = scheduler._base_trial_resources or PlacementGroupFactory([{"CPU": 1, "GPU": 0}])
    current_gpu = base_resources.required_resources.get("GPU", 0)
    peak_gpu_memory = result.get("peak_gpu_mem", 0)
    total_gpu_memory = result.get("total_gpu_mem", 1)

    gpu_utilization = peak_gpu_memory / total_gpu_memory
    if gpu_utilization >= 0.75 * current_gpu:
        print("Not changing GPU resource allocation")
        return None

    new_gpu_allocation = min(round(gpu_utilization * 1.2, 3), 1)
    print(f"Changing GPU allocation from {current_gpu} to {new_gpu_allocation}")
    return PlacementGroupFactory([{"CPU": base_resources.required_resources.get("CPU", 1), "GPU": new_gpu_allocation}])

   optuna_search = OptunaSearch(
        metric="validation/loss",
        mode="min"
    )
    
    asha = ASHAScheduler(
        time_attr="training_iteration",
        max_t=config["EPOCHS"],
        metric="validation/loss",
        mode="min",
        grace_period=config["GRACE_PERIOD"],
        reduction_factor=config["REDUCTION_FACTOR"]
    )
    
    scheduler = ResourceChangingScheduler(
        base_scheduler=asha,
        resources_allocation_function=resources_allocation_function,
    )


    tuner = tune.Tuner(
        tune.with_resources(
            tune.with_parameters(trainable),
            resources={
                "cpu": config["CPUS_PER_TRIAL"],
                "gpu": config["GPUS_PER_TRIAL"]
            }
        ),
        tune_config=tune.TuneConfig(
            max_concurrent_trials=config["MAX_CONCURRENT"],
            search_alg=optuna_search,
            scheduler=scheduler,
            num_samples=config["NUM_EXPERIMENTS"],
            time_budget_s=3600 * 12  # 12 hours
        ),
        run_config=RunConfig(
            storage_path=config["RAY_RESULTS_DIR"],
            failure_config=FailureConfig(max_failures=2),
            checkpoint_config=CheckpointConfig(
                num_to_keep=2,
                checkpoint_score_attribute="validation/loss",
                checkpoint_score_order="min",
            )
        ),
        param_space=config,
    )
    results = tuner.fit()

The reallocation works but here comes the first problem:
All trials are now being paused (have to be checkpointed and restarted as I understood from the docs). Now i end up with 90% of my started trials being paused until the last one has been started, before the paused trials continue to be executed. I would like to prioritize resuming paused trials over starting new ones. Is there an option to do this?

Now to my second question:
I used this code before to wrap my trainable in order to prevent OOM errors.

if config["DEVICE"] == "cuda":
        def tune_func(*args, **kwargs):
            tune.utils.wait_for_gpu(
                target_util=1-config["GPUS_PER_TRIAL"]  # utilization threshold to reach to unblock
            )
            train_model(*args, **kwargs)
        trainable = tune_func
    else:
        trainable=train_model

Is it possible to change the target_util to be something like „1 - trial._required_gpu_resources“? Otherwise the GPU would not be fully filled with jobs.

Best
Johann