Unable to request predictions for multiple handles in a for loop

ainatersol · May 8, 2025, 3:32pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.44
Python version: 3.11
OS: Ubuntu 22.0
Cloud/Infrastructure: AWS
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: I expected to be able to call many different application handles in a loop
Actual: when trying to call different applications, it hangs

I have set up several deployments in my cluster - I intend to call them using handles, the idea is to run the request and store the result in a db so I do not need to wait for the request to finish. I have a high load setup so I was trying to stress test the system by requesting results from multiple handles in a for loop.

the config looks like this

applications:
  - name: app0
    import_path: RayModelServiceHandlers:tf_app
    deployments: 
      - name: TFRayModelServiceBase
        ray_actor_options:
          num_gpus: 0.05
          num_cpus: 0.5
        autoscaling_config:
          target_ongoing_requests: 1
          min_replicas: 1
          max_replicas: 10
    runtime_env:
      env_vars:
        CODE_VERSION: "1.0.0"
        TF_ENABLE_ONEDNN_OPTS: "0"
        TF_CUDNN_USE_AUTOTUNE: "0"
      pip:
        - tensorflow==2.15.*

  - name: app1
    import_path: RayModelServiceHandlers:tf_app
    deployments: 
      - name: TFRayModelServiceBase
        ray_actor_options:
          num_gpus: 0.05
          num_cpus: 0.5
        autoscaling_config:
          target_ongoing_requests: 1
          min_replicas: 1
          max_replicas: 10
    runtime_env:
      env_vars:
        CODE_VERSION: "1.0.0"
        TF_ENABLE_ONEDNN_OPTS: "0"
        TF_CUDNN_USE_AUTOTUNE: "0"
      pip:
        - tensorflow==2.15.*

the for loop works when I am running multiple requests to the same application, but not when I request different ones. It also spins up the replicas properly when working only with one handle.

how can I get this to work for multiple applications?
even in the one-application setup, and considering I already start with at least one replica up - why is it that the first request takes a long time? it only starts after this error message:

gcs_rpc_client.h:151: Failed to connect to GCS at address 10.0.175.10:6379 within 5 seconds.
[2025-05-08 15:17:14,421 W 3624144 3624144] gcs_client.cc:178: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.

Topic		Replies	Views
Optimal way to handle for loop with multiple await calls Ray Serve	6	1020	June 22, 2022
Connecting to multiple ray clusters Ray Serve	2	30	May 6, 2025
Serve Handle Remote Calls Block Forever Ray Serve	7	832	April 16, 2023
What is going on behind "handle_request_with_rejection" calls? Ray Serve	4	135	August 13, 2024
RuntimeError: can't start new thread Ray Serve	2	1700	November 23, 2022

Unable to request predictions for multiple handles in a for loop

Related topics