What happened + What you expected to happen
Predict APIs on one of our endpoints were consistently returning errors, leading to outages in one of our production workflows. The issue later extended to other API routes too, and thereby impacting other functionalities as well.
Root Cause Analysis
-
A particular type of app failed to schedule on the RayServe platform for multiple deployments
-
Pods were consistently stuck in an Unready state with repeated errors in logs:
runtime_env_agent_client.cc:369: Create runtime env for job 01000000
-
Although a single replica of one of the deployments was initially available, it was eventually terminated as well.
-
Due to this, the readinessProbe/livenessProbe checks were intermittently failing on port
8000
for therayserve-worker
pods. -
Additionally, the GCS health check endpoint (
/api/gcs_healthz
on port8265
) failed for theray-head
container in the RayServe head pod.
Observing the same log being printed again and again in the cpu-only worker pod which is Unready
2025-05-22T07:41:14.392669721Z ray-cpu-worker [2025-05-22 00:41:14,392 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:14.392808161Z ray-cpu-worker [2025-05-22 00:41:14,392 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:14.897660052Z ray-cpu-worker [2025-05-22 00:41:14,897 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:14.897811652Z ray-cpu-worker [2025-05-22 00:41:14,897 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:15.674839026Z ray-cpu-worker [2025-05-22 00:41:15,674 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:15.674978757Z ray-cpu-worker [2025-05-22 00:41:15,674 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:16.401727748Z ray-cpu-worker [2025-05-22 00:41:16,401 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:16.401851197Z ray-cpu-worker [2025-05-22 00:41:16,401 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:17.097810089Z ray-cpu-worker [2025-05-22 00:41:17,097 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:17.097955459Z ray-cpu-worker [2025-05-22 00:41:17,097 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
2025-05-22T07:41:17.801948110Z ray-cpu-worker [2025-05-22 00:41:17,801 I 64 64] (raylet) runtime_env_agent_client.cc:369: Create runtime env for job 01000000
Runtime setup log (runtime_env_setup-01000000.log) also has repeated logs:
2025-05-22 00:42:29,997 INFO plugin.py:257 -- Runtime env pip pip://03b0b4b60765962f148fd91e395c8ad4f321e3bf is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2025-05-22 00:42:29,997 INFO uri_cache.py:71 -- Marked URI pip://03b0b4b60765962f148fd91e395c8ad4f321e3bf used.
2025-05-22 00:42:30,719 INFO uri_cache.py:84 -- Added URI gs://pixelbin-ml-worker-queue/ray-applications/RayApplicationsCI-deploy.ray.view-detection.v13.1746474693.zip with size 187817
2025-05-22 00:42:30,719 INFO plugin.py:257 -- Runtime env pip pip://03b0b4b60765962f148fd91e395c8ad4f321e3bf is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2025-05-22 00:42:30,719 INFO uri_cache.py:71 -- Marked URI pip://03b0b4b60765962f148fd91e395c8ad4f321e3bf used.
2025-05-22 00:42:31,418 INFO uri_cache.py:84 -- Added URI gs://pixelbin-ml-worker-queue/ray-applications/RayApplicationsCI-deploy.ray.view-detection.v13.1746474693.zip with size 187817
2025-05-22 00:42:31,418 INFO plugin.py:257 -- Runtime env pip pip://03b0b4b60765962f148fd91e395c8ad4f321e3bf is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2025-05-22 00:42:31,418 INFO uri_cache.py:71 -- Marked URI pip://03b0b4b60765962f148fd91e395c8ad4f321e3bf used.
2025-05-22 00:42:32,125 INFO uri_cache.py:84 -- Added URI gs://pixelbin-ml-worker-queue/ray-applications/RayApplicationsCI-deploy.ray.view-detection.v13.1746474693.zip with size 187817
2025-05-22 00:42:32,126 INFO plugin.py:257 -- Runtime env pip pip://03b0b4b60765962f148fd91e395c8ad4f321e3bf is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2025-05-22 00:42:32,126 INFO uri_cache.py:71 -- Marked URI pip://03b0b4b60765962f148fd91e395c8ad4f321e3bf used.
Expected Behavior
Pods should successfully schedule and maintain readiness/liveness for predictable API availability. Runtime environment creation should not block pod readiness.
Actual Behavior
Pods remain unready with repeated runtime environment creation failures, leading to API-level outages.
Additional Context
Please let us know if further logs, configs, or metrics would help in identifying the root issue.
/cc @ray-project/serve
Versions / Dependencies
Environment Details
- Ray version: 2.10.0
- Python Version: 3.11.8
- Deployment type: K8s
- Platform: GKE
Reproduction script
We were not able to reproduce the issue reliably in lower environments or test clusters. The failure appears to be intermittent or environment-specific, possibly related to scale, network conditions, or RayServe’s internal runtime environment management.
Issue Severity
High: It blocks me from completing my task.