RayJob enableInTreeAutoscaling crash loop

Scott_Zelenka · June 6, 2023, 8:53pm

Attempting to configure a RayJob via CRD, without enableInTreeAutoscaling=true it’ll run, but not scale up based on the workload like a regular RayCluster does.

The RayJob CRD allows for spec.rayClusterSpec.enableInTreeAutoscaling, but when deploying the RayJob CRD with this set, the Operator seems to be stuck in a loop of detecting a change, deleting all the workers back to 1 replica, then the autoscaler sidecar will attempt to scale it up, etc. etc.

What is the proper method for configuring a RayJob CRD, such that the RayCluster it creates is permitted to scale up?

Testing on Ray 2.4.0 on Python 3.9 with the official Docker image on K8s 1.24

eschneeweiss · August 27, 2024, 1:26pm

I am dealing with a similar issue, is there any answer on this?

Topic		Replies	Views
Why the KubeRay disable the autoscaling in default? Kubernetes	8	932	October 8, 2022
[Kuberay] Enabling/configuring autoscaling via kuberay-apiserver and/or ray-cluster Helm chart Kubernetes	1	555	January 20, 2023
Autoscaler not scaling up the worker node when using image rayproject/ray:1.11.0-py38 Kubernetes	3	894	July 2, 2022
Autoscaler doesn't scale workers on K8s	5	692	February 15, 2021
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	724	November 20, 2023

RayJob enableInTreeAutoscaling crash loop

Related topics