Is there a way to stop or delete the head node once the job is done?

M_S · June 2, 2022, 12:50pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Hello everyone,

my understanding of ray clusters is that one I started one with ray up cluster.yaml, the only ways to shut it down is to run ray down cluster.yaml or to remove the resources somehow outside of ray.
Is there a way to shutdown a cluster once a job is done (in the sense that the VM is deleted/deallocated)?
My use case is that I occasionally will have jobs that run very long. The cluster should only exist for the runtime of the job and then remove itself. One option would be to just remove the head node if it is idle for too long, however I could find that option only for the workers.

Is that possible in any way? Ideally just with ray, but if there is an option e.g. with Kubernetes that would also be okay.

Thank you!

Alex · June 2, 2022, 5:30pm

If you’re looking for a quick and dirty solution, you can use ray submit --stop to run your job. That being said, the issue is that if the runner of that command dies, it won’t stop the cluster.

The battle hardened recommendation is to handle this with a workflow engine of your choice (e.g. Airflow).

In the kubernetes world, you could run your ray job submitter in a k8s job and have an operator teardwon the cluster when the k8s job completes successfully.

M_S · June 2, 2022, 6:02pm

Hi @Alex,

thanks for the quick reply.

Is this something that will come at some point?
I would find it really useful, if ray internally had some way to fully tear down it’s cluster at the end of a script.

Thanks!

Alex · June 6, 2022, 5:41pm

In the medium/long term, KubeRay will make it easier, but I don’t know the timeline for a feature that provides truly HA job submission.

@eoakes @shrekris may have a better idea?

shrekris · June 6, 2022, 5:53pm

@M_S, as @Alex mentioned, we’re focusing on making HA easier to accomplish with KubeRay. We haven’t estimated a timeline for out-of-the-box HA, but these soon-to-come KubeRay improvements will be step towards that.

cade · June 15, 2022, 10:06pm

Hi @M_S, thanks for the question! I’ll mark this question as resolved, feel free to respond or open a new question if you have more questions.

Topic		Replies	Views
Shutting down a manually launched cluster Ray Clusters	0	388	December 14, 2021
Graceful Exit from Cluster Ray Workflows	3	539	March 7, 2023
Starting and stopping Ray clusters on Kubernetes fails Kubernetes	5	1563	February 16, 2022
Stopping pending job in a KubeRay cluster Kubernetes	0	153	August 26, 2024
Ray controller restart worker pod after head pod restart Kubernetes	0	388	November 19, 2023

Is there a way to stop or delete the head node once the job is done?

Related topics