I am deploying a service using ray serve in my EKS cluster. And there are 6 different ray services running in the cluster.
I am seeing that sometimes the ray serve deployment head is going into ContainerStatusUnknown
state, and it’s only occurring to one or two pod at a time. other ray serve head pods use to run fine.
Other thing worrying me that why kuberay is not re-creating the pod if the service is not able to run for a long time. Below are the logs from kuberay
{"level":"info","ts":"2024-09-27T05:42:28.322Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"roberta-bert","namespace":"ray-serve"},"reconcileID":"38c2c064-6481-4d9e-8bf2-1e6acb0a4b22","RayCluster name":"roberta-bert-raycluster-mpdpm"}
{"level":"info","ts":"2024-09-27T05:42:28.322Z","logger":"controllers.RayService","msg":"Skipping the update of Serve deployments because the Ray head Pod is not ready.","RayService":{"name":"roberta-bert","namespace":"ray-serve"},"reconcileID":"38c2c064-6481-4d9e-8bf2-1e6acb0a4b22"}
Ray serve version: 2.35
Kuberay version 1.1.0