.remote() call occasionally hangs

  • High: It blocks me to complete my task.

Issue: Occasionally calls to .remote() hang much longer than normal.

Hi before I make an issue on github id like if someone could sanity check my setup.
I have three services in my architecture.
1.) My own FastApi server.
2.) A Request pre-processing Ray Deployment.
3.) A Compute Ray Deployment.

The FastApi server is my own custom ingress endpoint, not using Ray’s ingress functionality. It connects to an existing Ray cluster via ray.init() on initialization. When the async request endpoint is hit, it uses serve.get_app_handle("RequestApplication").remote(request) to send the request to the Request Deployment.

The Request deployment is on the Ray head node and based on the request received, sends the request to the correct compute deployment like serve.get_app_handle(<compute app name>).remote(request). This happens in the deployment’s async __call__ method.

The Compute deployment is on a different ray node on a different machine than the FastApi server and Ray head node. The Compute deployment receives the request also on its async __call__ method, executes the compute heavy request, and handles sending the data back to the FastApi server.

Things to note:

  • Neither the FastApi server nor the Request Deployment wait for the response from .remote(request) as the Compute deployment handles the result separate from Ray.
  • The request is a pydantic object which contains a large object as one of its attributes. In the FastApi server, it calls ray.put on that attribute, which then the Compute deployment calls ray.get
  • In the logs for for the Request deployment, I consistently see “LongPollClient polling timed out. Retrying.” Not sure if this is a problem.
  • In the FastApi server I sometimes see: “WARNING 2024-09-24 09:51:14,270 serve 20 pow_2_scheduler.py:536 - Failed to get queue length from Replica(id=‘was0z9gr’, deployment=‘ModelDeployment’, app=‘Model:nnsight-models-languagemodel-languagemodel-repo-id-eleutherai-gpt-j-6b’) within 1.0s. If this happens repeatedly it’s likely caused by high network latency in the cluster. You can configure the deadline using the RAY_SERVE_QUEUE_LENGTH_RESPONSE_DEADLINE_S environment variable.”
  • This is often followed by: “concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x7f118c190910 state=cancelled>”

Is there anything I need to do as I’m not waiting for the results of requests to .remote()? Should I be “closing” the DeploymentHandles in some way after I’ve sent data via their .remote() method?

Sorry for the long post. Please let me know if theres any logs or information I can provide.

I don’t immediately see anything wrong with the setup. You shouldn’t need to “close” the handles or anything like that. The long poll client timeouts and the replica response latency errors are concerning so I would be surprised if it’s not related.

Do you see any errors in the serve controller logs (/tmp/ray/session_latest/logs/controller_<pid>.log)? Also, what does the CPU utilization look like? It could be that the Python processes have very high CPU contention.

I’d suggest that you file a GitHub issue so we can track the issue and discuss it there. Please also provide any additional logs and details about your setup (such as the Ray version).