Create gpu node only for the training purpose then destroy it

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Training require a lot of compute, and need gpu for the training, but if they remain alive for long they cost too much, is their any method in ray air to create gpu node only for the training purpose and the node get destroyed after the training is done. If it is possible then , a lot money can be saved.

Below is my tensorflow trainer code.

Can you explain a bit about your workload? Seems like training is just one component there? Generally speaking, you can specify your cluster.yaml to automatically increase GPU node count for training and decrease it when done. (what we call autoscaling) Does this suffice your needs?
A word of caution: autoscaling may take some time to kick in and scale back, as you may imagine.