Hydra-Ray Launcher on SLURM Ray Cluster

NathanielPoteat · October 29, 2024, 4:50pm

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am working on a project where we are creating thousands of unique ML models with Ray Tune for hyperparameter optimization and Hydra for configuration and code reuse. The code works perfectly on my machine, but all of my attempts to run it using a SLURM cluster have failed.

Creating a Ray cluster in my sbatch script without using the Hydra Ray Launcher runs Ray Tune for the same model in parallel - resulting in worse results for the same number of iterations locally. This method also frequently runs out of space in the /tmp directory. We have even attempted to set the ray temp directory elsewhere to that same results.

Using the Hydra Ray Launcher also results in the /tmp directory filling up. However, this also solves the issue of getting worse results for the same number of iterations by running Ray Tune for different models in parallel and would be the preferred method to go forward. However, this method fails when an unhandled error occurs during tuning for a seemingly arbitrary model each time. This send a signal to kill the Ray Cluster and Hydra multirun. We do have error handling inside and around the tuning code in an attempt to keep the cluster and job running.

Another potential solution we have looked into is having our sbatch script make a separate srun job for each combination of Hydra hydra config files instead of using Hydra multirun, but we haven’t had any luck with this either.

Thanks!

Kai-Hsun_Chen · October 31, 2024, 7:43pm

Would you mind opening an issue on GitHub and cc me (kevin85421)? Anyscale only maintains the open-source VM and K8s launchers (KubeRay). The Slurm one is contributed by the community. I can try to ping the contributors on the issue.

Topic		Replies	Views
Running RayTune on Slurm Cluster in PyTorch Lightning Ray Tune	1	431	February 13, 2023
Running ray air for pytorch hyperparameter tuning on SLURM cluster Ray Tune	2	1099	February 7, 2023
[SGD] Hydra + RaySGD (PyTorch Lightning) Ray Tune	2	614	June 15, 2021
Alternatives for tmp directory Ray Tune	2	1111	May 4, 2021
Concurrency using ray.tune, slurm and BOHB Ray Tune	5	591	April 20, 2022

Hydra-Ray Launcher on SLURM Ray Cluster

Related topics