Best Practices for Optimizing Ray Tune Trials

sahoc35801 · December 17, 2024, 5:41am

Hi everyone…

I’m currently using Ray Tune to optimize hyperparameters for a machine learning model, and I’m running into performance issues when scaling up the number of trials. I’d like to know:

What’s the best way to balance resource allocation across trials?
Are there specific schedulers or search algorithms that perform better for large-scale runs?

I check this: https://discuss.ray.io/t/best-practices-to-run-multiple-models-in-multiple-gpus-in-rayll and DevOps course online But I have not found any solution. Could anyone guide me about this? I’m working on a cluster setup with limited CPU/GPU resources, so I’m trying to avoid unnecessary overhead. Any tips or real-world examples would be greatly appreciated!

Thanks in advance.

Respected community member!

darcie.eves · June 17, 2025, 10:31pm

Hi everyone

I’m currently using Ray Tune to optimize hyperparameters for a machine learning model, and I’m running into performance issues when scaling up the number of trials. I’d like to know:

What’s the best way to balance resource allocation across trials?
Are there specific schedulers or search algorithms that perform better for large-scale runs?

I check this: https://discuss.ray.io/t/best-practices-to-run-multiple-models-in-multiple-gpus-in-rayll and DevOps course online But I have not found any solution. Could anyone guide me about this? I’m working on a cluster setup with limited CPU/GPU resources, so I’m trying to avoid unnecessary overhead. Any tips or real-world examples would be greatly appreciated!

Thanks in advance.

PhilippWillms · June 19, 2025, 8:44pm

I know that situation. Hence, consider a good checkpointing rhythm and mechanism. This will provide re-entry points to continue trials at a later point-in-time.

My topic was that I used to run into autoscaler overflows. Autoscaler told me that there were not enough resources to allocate for the trial, but analyzing the details revealed that for one trial the resource demand was higher than expected.

Especially in rllib, one can underestimate which amount of CPUs are then required by distributed rollouts (old stack) or env runners (new stack).

At the end of the day, I wrote myself the following utility method which distributes adequately.

def calculate_resources(args, verbose=False) -> tuple[int]:
    """
    Calculate the resources for head node (i.e. algorithm) and workers (i.e. rollout workers and evaluation worker).

    The resources are calculated based on the following logic:

    1. Calculate the GPU resources for the head node (algorithm):
       - `split_gpu_trainer` is calculated as the total number of GPUs divided by the number of concurrent trials,
         rounded down to the nearest tenth.

    2. Calculate the CPU resources for each worker:
       - `num_cpu_per_worker` is calculated as the total number of CPUs minus one (reserved for the head node),
         divided by the product of the number of workers and the number of concurrent trials.

    3. Calculate the GPU resources for each worker:
       - If `split_gpu_trainer` is less than 1, `num_gpu_per_worker` is set to 0.
       - Otherwise, `num_gpu_per_worker` is calculated as the maximum of the total number of GPUs minus one and 1,
         divided by the number of workers, rounded down to the nearest integer.

    Args:
        args: A namespace object containing the following attributes:
            - num_gpus: Total number of GPUs available.
            - num_cpus: Total number of CPUs available.
            - concurrent_trials: Number of concurrent trials to run.
            - num_workers: Number of workers for each trial.
        verbose: Whether to print the calculated resources.

    Prints:
        - Planned number of concurrent trials and workers.
        - Calculated resource demand for GPUs and CPUs.
        - Total resources available (CPUs and GPUs).
    """
    split_gpu_trainer = math.floor((args.num_gpus / args.concurrent_trials) * 10) / 10
    num_cpu_per_worker = (int)(
        (args.num_cpus - 1) / (args.num_workers * args.concurrent_trials)
    )
    if split_gpu_trainer < 1:
        num_gpu_per_worker = 0
    else:
        num_gpu_per_worker = math.floor(max(args.num_gpus - 1, 1) / args.num_workers)

    if verbose:
        print(
            f"Planned to run {args.concurrent_trials} concurrent trials with {args.num_workers} workers each."
        )
        print(
            f"Calculated resource demand: {args.concurrent_trials} * ({split_gpu_trainer} GPUs for PPO + ({args.num_workers} * {num_cpu_per_worker} CPUs for rollout workers + 1 CPU for evaluation worker) + ({args.num_workers} * {num_gpu_per_worker} GPUs))"
        )
        print(f"Total resources: {args.num_cpus} CPUs and {args.num_gpus} GPUs.\n")

    return split_gpu_trainer, num_cpu_per_worker, num_gpu_per_worker

Topic		Replies	Views
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	44	December 18, 2024
How do I ask Ray to autoscale the resources for tuning? Ray Tune	7	412	March 9, 2021
Concurrency using ray.tune, slurm and BOHB Ray Tune	5	580	April 20, 2022
Distributed asynchronous hyperparameter tuning on queued resources Ray Tune	2	358	March 20, 2023
tune.Tuner trials not using specified resources with rllib Ray Tune	7	270	March 14, 2025

Best Practices for Optimizing Ray Tune Trials

Related topics