- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I am working on a project where we are creating thousands of unique ML models with Ray Tune for hyperparameter optimization and Hydra for configuration and code reuse. The code works perfectly on my machine, but all of my attempts to run it using a SLURM cluster have failed.
Creating a Ray cluster in my sbatch script without using the Hydra Ray Launcher runs Ray Tune for the same model in parallel - resulting in worse results for the same number of iterations locally. This method also frequently runs out of space in the /tmp directory. We have even attempted to set the ray temp directory elsewhere to that same results.
Using the Hydra Ray Launcher also results in the /tmp directory filling up. However, this also solves the issue of getting worse results for the same number of iterations by running Ray Tune for different models in parallel and would be the preferred method to go forward. However, this method fails when an unhandled error occurs during tuning for a seemingly arbitrary model each time. This send a signal to kill the Ray Cluster and Hydra multirun. We do have error handling inside and around the tuning code in an attempt to keep the cluster and job running.
Another potential solution we have looked into is having our sbatch script make a separate srun job for each combination of Hydra hydra config files instead of using Hydra multirun, but we haven’t had any luck with this either.
Thanks!