Auto Termination feature

  • Low: It annoys or frustrates me for a moment.

Even though it’s marked as Low priority the overall impact is huge when looking into the costs.

As the title suggests, I am looking for an Auto Termination feature where a Ray Cluster shuts down after ‘N’ minutes of no activity.

Although the cluster autoscaler works, it does not solves these two issues:

  1. The head node continues to run and its a bigger machine (4xlarge or higher) for our work loads
  2. Some workloads that require GPUs must maintain a set of nodes thereby requiring min, max to be same for the instance count. In this scenario, the autoscaler will not scale it down even when the tasks is complete.

I am curious to know if this is supported or is on the roadmap or is there a suggested solution lying somewhere.

Thanks for reading,