Have workers quit after one tune trial or not accept new trials after certain time (workaround for SLURM submission)

kemok · January 21, 2023, 6:15pm

TL;DR: Looking for ways to deal with fixed run times of batch jobs with SLURM when running ray tune

Context: I run ray tune on a SLURM cluster. I have read the guide in the ray docs but follow a slightly different strategy, where I run the ray head on my head node and only have the workers run in batch nodes (demo repository here).

The problem: While this works great (and is preferable to what is currently recommended in the guide IMO), I still face the following problem: All batch submissions must have a definite maximum run time, say 2h. This means that my workers will kill 2h after they become live. My current ML trials run for up to 1h, depending on early stopping. This means that many of the very successful trials get stopped in the middle because the worker is killed.

Workarounds Can I have a worker accept only one job and then quit? Or, even better, can I have a worker not accept any more trials and quit after a certain time? In that case, I could tune this “early quitting” in such a way that I ensure that the last trial would have time to run sufficiently long.

Or am I thinking in the wrong direction here? In an ideal world, I would like to go very close to the time limit, then checkpoint and have another worker node take over training later.

PS: Is Ray Clusters the correct topic? Or should I move it to Ray Air?

kemok · January 24, 2023, 4:34pm

A different avenue that might be easier to implement (and has many other advantages) might be to implement “soft stopping” of experiments, where all already running trials are finished but no new trials are enqueued. I’ve opened a different post for this question.

Topic		Replies	Views
1 trial per worker	0	168	December 20, 2023
Ray + slurm crashes early in run Ray Core	0	200	March 21, 2024
Stop programmatically using running time Ray Tune	1	515	November 16, 2022
Distributed asynchronous hyperparameter tuning on queued resources Ray Tune	2	358	March 20, 2023
How to setup slurm cluster to have idle timeout for workers? Ray Clusters	0	185	January 18, 2024

Have workers quit after one tune trial or not accept new trials after certain time (workaround for SLURM submission)

Related topics