1 trial per worker

wolfdewulf · December 20, 2023, 4:18pm

The HPC cluster I work on recently implemented time limits per tasks.
Before we would be able to set up a ray cluster with a head and a couple of workers and keep those running for as long as we wanted.
Now more people are starting to use the cluster and a time limit was installed.

To accommodate for this we thought it’d be most efficient if we could boot up a head node and subsequently queue a bunch of workers such that each would:

connect to the head node
run a single trial
report the results
stop gracefully such that the cluster considers the worker job as finished (the cluster uses Kubernetes)

Our first naive approach would call ray.report to get the results to the head node and then we’d call os.system("ray stop --force") to make that worker stop and disconnect from the head node.

The issues with this approach are the following:

the ray search on the head node considers the trial as failed and gets stuck
the worker’s Kubernetes Job errors instead of exiting gracefully

Is there any way of implementing this workflow with ray tune?
Any advice will be greatly appreciated!

Topic		Replies	Views
Have workers quit after one tune trial or not accept new trials after certain time (workaround for SLURM submission) Ray Clusters	1	261	January 24, 2023
Reading logs on worker nodes Ray Tune	4	687	March 23, 2022
Relationship between Ray Workers and trials and CPUs Ray Tune	7	34	April 7, 2025
RayTune cluster not distributing load correctly? Ray Tune	4	235	November 14, 2023
Most runs immediately failing with "out of memory" Ray Tune	5	1213	May 11, 2021

1 trial per worker

Related topics