The HPC cluster I work on recently implemented time limits per tasks.
Before we would be able to set up a ray cluster with a head and a couple of workers and keep those running for as long as we wanted.
Now more people are starting to use the cluster and a time limit was installed.
To accommodate for this we thought it’d be most efficient if we could boot up a head node and subsequently queue a bunch of workers such that each would:
- connect to the head node
- run a single trial
- report the results
- stop gracefully such that the cluster considers the worker job as finished (the cluster uses Kubernetes)
Our first naive approach would call ray.report
to get the results to the head node and then we’d call os.system("ray stop --force")
to make that worker stop and disconnect from the head node.
The issues with this approach are the following:
- the ray search on the head node considers the trial as failed and gets stuck
- the worker’s Kubernetes Job errors instead of exiting gracefully
Is there any way of implementing this workflow with ray tune?
Any advice will be greatly appreciated!