Help debugging blocked serve deployment

I am running ray serve on an auto-scaling ray cluster deployed on kubernetes. I am trying to debug an issue with the deployment suddenly getting blocked after successfully receiving requests for some time.

Here is a simplified snippet of my setup -


@serve.deployment(
    max_concurrent_queries=2,
    ray_actor_options={
        "max_concurrency": 1,
        "memory": 30 * 1024 * 1024 * 1024
    },
    _autoscaling_config={
        "target_num_ongoing_requests_per_replica": 1,
        "max_replicas": 10,
        "min_replicas": 3
    },
    version="0.1"
)
class Model:

    def predict(self):
      # preprocessing, inference and post-processing
      return "y"

app = FastAPI()

results = {}

@serve.deployment(route_prefix="/")
@serve.ingress(app)
class API:

    def __init__(self):
        self._debug = True

    @app.post("/")
    async def predict(self):

        logger.info("request start")
        task_id = str(uuid.uuid4())
        handle = Model.get_handle(sync=False)
        coro = handle.predict.remote()

        async def wait(coro, task_id):

            ref = await coro
            pred = await ref
            results[task_id] = pred

        async def debug():
            while True:
              logger.info("running")
              await asyncio.sleep(10)
        
        loop = asyncio.get_running_loop()
        loop.set_debug(True)
        if self._debug:
          loop.create_task(debug())
          self._debug = False
        loop.create_task(wait(coro, task_id))
        return task_id

    @app.get("/{task_id}")
    async def get(self, task_id: str):
        return results[task_id]

The Model deployment computation is quite expensive, it takes about 1 minute. So the predict method in the API returns an id to the caller immediately and waits for the result from the model in a background task.

This works well initially. But when subjecting the API to a sustained load of about 3 predict requests/minute, after about 30-60 minutes it stops receiving any request. Further requests to both the predict and get endpoint of API hangs.

I have verified that the event loop in the API actor is not blocked when this happens - the debug coroutine keeps running. If I redeploy the API deployment at this point, it starts receiving the requests again, so the problem is definitely somewhere within ray. I have taken a look the logs of HttpProxyActor and other ray core worker logs, but did not find anything relevant. How can I troubleshoot this further?

I have further observed in my trials that if the autoscaling of the Model deployment is turned off, or if I set the max_replicas much lower to 4 to limit the autoscaling, then this problem did not occur.

I wonder if this is an issue with Serve trying to provision more replicas than possible. Could you try setting num_replicas to 10 and removing autoscaling? Is your setup able to provision all 10 replicas?