Unable to request predictions for multiple handles in a for loop

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.44
  • Python version: 3.11
  • OS: Ubuntu 22.0
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: I expected to be able to call many different application handles in a loop
  • Actual: when trying to call different applications, it hangs

I have set up several deployments in my cluster - I intend to call them using handles, the idea is to run the request and store the result in a db so I do not need to wait for the request to finish. I have a high load setup so I was trying to stress test the system by requesting results from multiple handles in a for loop.

the config looks like this

applications:
  - name: app0
    import_path: RayModelServiceHandlers:tf_app
    deployments: 
      - name: TFRayModelServiceBase
        ray_actor_options:
          num_gpus: 0.05
          num_cpus: 0.5
        autoscaling_config:
          target_ongoing_requests: 1
          min_replicas: 1
          max_replicas: 10
    runtime_env:
      env_vars:
        CODE_VERSION: "1.0.0"
        TF_ENABLE_ONEDNN_OPTS: "0"
        TF_CUDNN_USE_AUTOTUNE: "0"
      pip:
        - tensorflow==2.15.*

  - name: app1
    import_path: RayModelServiceHandlers:tf_app
    deployments: 
      - name: TFRayModelServiceBase
        ray_actor_options:
          num_gpus: 0.05
          num_cpus: 0.5
        autoscaling_config:
          target_ongoing_requests: 1
          min_replicas: 1
          max_replicas: 10
    runtime_env:
      env_vars:
        CODE_VERSION: "1.0.0"
        TF_ENABLE_ONEDNN_OPTS: "0"
        TF_CUDNN_USE_AUTOTUNE: "0"
      pip:
        - tensorflow==2.15.*

the for loop works when I am running multiple requests to the same application, but not when I request different ones. It also spins up the replicas properly when working only with one handle.

  1. how can I get this to work for multiple applications?
  2. even in the one-application setup, and considering I already start with at least one replica up - why is it that the first request takes a long time? it only starts after this error message:
gcs_rpc_client.h:151: Failed to connect to GCS at address 10.0.175.10:6379 within 5 seconds.
[2025-05-08 15:17:14,421 W 3624144 3624144] gcs_client.cc:178: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.