Have workers quit after one tune trial or not accept new trials after certain time (workaround for SLURM submission)

TL;DR: Looking for ways to deal with fixed run times of batch jobs with SLURM when running ray tune

Context: I run ray tune on a SLURM cluster. I have read the guide in the ray docs but follow a slightly different strategy, where I run the ray head on my head node and only have the workers run in batch nodes (demo repository here).

The problem: While this works great (and is preferable to what is currently recommended in the guide IMO), I still face the following problem: All batch submissions must have a definite maximum run time, say 2h. This means that my workers will kill 2h after they become live. My current ML trials run for up to 1h, depending on early stopping. This means that many of the very successful trials get stopped in the middle because the worker is killed.

Workarounds Can I have a worker accept only one job and then quit? Or, even better, can I have a worker not accept any more trials and quit after a certain time? In that case, I could tune this “early quitting” in such a way that I ensure that the last trial would have time to run sufficiently long.

Or am I thinking in the wrong direction here? In an ideal world, I would like to go very close to the time limit, then checkpoint and have another worker node take over training later.

PS: Is Ray Clusters the correct topic? Or should I move it to Ray Air?

A different avenue that might be easier to implement (and has many other advantages) might be to implement “soft stopping” of experiments, where all already running trials are finished but no new trials are enqueued. I’ve opened a different post for this question.