Ray head pods are getting stuck in ContainerStatusUnknown state

I am deploying a service using ray serve in my EKS cluster. And there are 6 different ray services running in the cluster.

I am seeing that sometimes the ray serve deployment head is going into ContainerStatusUnknown state, and it’s only occurring to one or two pod at a time. other ray serve head pods use to run fine.

Other thing worrying me that why kuberay is not re-creating the pod if the service is not able to run for a long time. Below are the logs from kuberay
{"level":"info","ts":"2024-09-27T05:42:28.322Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"roberta-bert","namespace":"ray-serve"},"reconcileID":"38c2c064-6481-4d9e-8bf2-1e6acb0a4b22","RayCluster name":"roberta-bert-raycluster-mpdpm"}
{"level":"info","ts":"2024-09-27T05:42:28.322Z","logger":"controllers.RayService","msg":"Skipping the update of Serve deployments because the Ray head Pod is not ready.","RayService":{"name":"roberta-bert","namespace":"ray-serve"},"reconcileID":"38c2c064-6481-4d9e-8bf2-1e6acb0a4b22"}

Ray serve version: 2.35
Kuberay version 1.1.0

Hi @Ritesh_K, searching about ContainerStatusUnknown online shows that it’s often connected with running out of ephemeral storage or OOM issues. Could you check if that’s the case here, by getting the output of kubectl describe pod and pasting it here?