Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization

romie090 · December 18, 2024, 11:36am

Hi everyone,

I’ve been experimenting with Ray Tune to perform hyperparameter optimization for a deep learning model. While it works well for small-scale tasks, I’m running into some issues when scaling up to larger workloads that involve many trials and significant resource consumption.

Here’s my current setup:

Framework: PyTorch
Cluster: 8-node setup, each with 16 cores, 64GB RAM, and 1 GPU per node
Search Algorithm: Optuna + ASHA scheduler
Task: Training a model for ~50,000 trials with varying learning rates, batch sizes, and hidden layer dimensions

Challenges I’m Facing:

Resource Utilization: While I expect high resource utilization, I see CPU/GPU usage fluctuating, and sometimes nodes sit idle. I’ve ensured my cluster configuration in Ray is correct, but there still seems to be inefficiencies.
Trial Scheduling Overhead: With a large number of concurrent trials, the scheduling overhead increases significantly. Are there specific configurations or scheduler parameters I should tweak to minimize this?
Checkpointing: I’m saving checkpoints for each trial, but this becomes resource-intensive and sometimes causes delays. Any suggestions on optimizing checkpoint frequency or handling storage efficiently?

Goals:

I want to ensure:

Maximum resource utilization across nodes (CPU/GPU)
Efficient scheduling to reduce trial overhead
Scalable checkpointing for large-scale experiments

Has anyone faced similar issues or found effective strategies for optimizing Ray Tune under high workloads? I’d love to hear your experiences or advice on best practices, configuration tweaks, or alternative alteryx tools that complement Ray Tune.

Thanks in advance for your insights!

Topic		Replies	Views
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	18	December 18, 2024
Best Practices for Optimizing Ray Tune Trials RLlib	2	34	June 19, 2025
Specify trial resources when using search algorithm to tune hyper-parameters RLlib	2	519	September 7, 2022
Concurrency using ray.tune, slurm and BOHB Ray Tune	5	610	April 20, 2022
Specify trial resources when using Optuna search algorithm to tune hyper-parameters	4	798	October 3, 2022

Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization

Challenges I’m Facing:

Goals:

Related topics