1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.44
- Python version: 3.11
- OS: Ubuntu 22.0
- Cloud/Infrastructure: AWS
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: I expected to be able to call many different application handles in a loop
- Actual: when trying to call different applications, it hangs
I have set up several deployments in my cluster - I intend to call them using handles, the idea is to run the request and store the result in a db so I do not need to wait for the request to finish. I have a high load setup so I was trying to stress test the system by requesting results from multiple handles in a for loop.
the config looks like this
applications:
- name: app0
import_path: RayModelServiceHandlers:tf_app
deployments:
- name: TFRayModelServiceBase
ray_actor_options:
num_gpus: 0.05
num_cpus: 0.5
autoscaling_config:
target_ongoing_requests: 1
min_replicas: 1
max_replicas: 10
runtime_env:
env_vars:
CODE_VERSION: "1.0.0"
TF_ENABLE_ONEDNN_OPTS: "0"
TF_CUDNN_USE_AUTOTUNE: "0"
pip:
- tensorflow==2.15.*
- name: app1
import_path: RayModelServiceHandlers:tf_app
deployments:
- name: TFRayModelServiceBase
ray_actor_options:
num_gpus: 0.05
num_cpus: 0.5
autoscaling_config:
target_ongoing_requests: 1
min_replicas: 1
max_replicas: 10
runtime_env:
env_vars:
CODE_VERSION: "1.0.0"
TF_ENABLE_ONEDNN_OPTS: "0"
TF_CUDNN_USE_AUTOTUNE: "0"
pip:
- tensorflow==2.15.*
the for loop works when I am running multiple requests to the same application, but not when I request different ones. It also spins up the replicas properly when working only with one handle.
- how can I get this to work for multiple applications?
- even in the one-application setup, and considering I already start with at least one replica up - why is it that the first request takes a long time? it only starts after this error message:
gcs_rpc_client.h:151: Failed to connect to GCS at address 10.0.175.10:6379 within 5 seconds.
[2025-05-08 15:17:14,421 W 3624144 3624144] gcs_client.cc:178: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.