Help debugging blocked serve deployment

I am running ray serve on an auto-scaling ray cluster deployed on kubernetes. I am trying to debug an issue with the deployment suddenly getting blocked after successfully receiving requests for some time.

Here is a simplified snippet of my setup -

        "max_concurrency": 1,
        "memory": 30 * 1024 * 1024 * 1024
        "target_num_ongoing_requests_per_replica": 1,
        "max_replicas": 10,
        "min_replicas": 3
class Model:

    def predict(self):
      # preprocessing, inference and post-processing
      return "y"

app = FastAPI()

results = {}

class API:

    def __init__(self):
        self._debug = True"/")
    async def predict(self):"request start")
        task_id = str(uuid.uuid4())
        handle = Model.get_handle(sync=False)
        coro = handle.predict.remote()

        async def wait(coro, task_id):

            ref = await coro
            pred = await ref
            results[task_id] = pred

        async def debug():
            while True:
              await asyncio.sleep(10)
        loop = asyncio.get_running_loop()
        if self._debug:
          self._debug = False
        loop.create_task(wait(coro, task_id))
        return task_id

    async def get(self, task_id: str):
        return results[task_id]

The Model deployment computation is quite expensive, it takes about 1 minute. So the predict method in the API returns an id to the caller immediately and waits for the result from the model in a background task.

This works well initially. But when subjecting the API to a sustained load of about 3 predict requests/minute, after about 30-60 minutes it stops receiving any request. Further requests to both the predict and get endpoint of API hangs.

I have verified that the event loop in the API actor is not blocked when this happens - the debug coroutine keeps running. If I redeploy the API deployment at this point, it starts receiving the requests again, so the problem is definitely somewhere within ray. I have taken a look the logs of HttpProxyActor and other ray core worker logs, but did not find anything relevant. How can I troubleshoot this further?

I have further observed in my trials that if the autoscaling of the Model deployment is turned off, or if I set the max_replicas much lower to 4 to limit the autoscaling, then this problem did not occur.

I wonder if this is an issue with Serve trying to provision more replicas than possible. Could you try setting num_replicas to 10 and removing autoscaling? Is your setup able to provision all 10 replicas?