1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.55.0
- Python version: 3.10
- OS: Linux Rocky9
- Cloud/Infrastructure:
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: (Hoping for) Ability to further customize HAProxy config.
- Actual: Some gaps in HAProxy config that we would like to change.
When HAProxy was announced, we jumped on it - Excellent! Our initial attempt found that haproxy wasn’t starting up (we were running a nightly). So we enabled HAProxy in ray-serve to get the per-replica http servers and ran haproxy outside of ray-serve - connecting to ports 30000 - 30005.
With this configuration we saw the massive performance increase as advertised. We went further: tuning the haproxy config so it worked well with our clients hitting the (ray) serves.
On looking at the official release and seeing how well it works for us, with its’ generated haproxy config we’ve got some gaps that we can’t override with environment variables. What we’re hoping for is a way to customize the haproxy config, like supplying our own template, or ability to inject overrides somehow (Other enviroment vars?). Here’s a claude-code analysis of what we can do with env variables and what the gaps are:
❯ what RAY_SERVE* environment variables do I need to set to achieve the configuration in start-haproxy.sh ?
⏺ Here's the mapping. I'll split into what you can achieve with env vars and what you can't.
Env vars to set (differ from Ray defaults)
RAY_SERVE_HAPROXY_MAXCONN=130000 # yours: 130000, Ray default: 20000
RAY_SERVE_HAPROXY_TIMEOUT_CONNECT_S=5 # yours: 5s, Ray default: None (omitted)
RAY_SERVE_HAPROXY_TIMEOUT_CLIENT_S=60 # yours: 60s, Ray default: 3600
RAY_SERVE_HAPROXY_TIMEOUT_SERVER_S=120 # yours: 120s, Ray default: None (omitted)
RAY_SERVE_HAPROXY_HEALTH_CHECK_FASTINTER=1s # yours: 1s, Ray default: 250ms
Already matching Ray defaults (no change needed)
# RAY_SERVE_HAPROXY_NBTHREAD=4 # matches
# RAY_SERVE_HAPROXY_BALANCE_ALGORITHM=leastconn # matches
# RAY_SERVE_HAPROXY_HEALTH_CHECK_FALL=2 # matches
# RAY_SERVE_HAPROXY_HEALTH_CHECK_RISE=2 # matches
# RAY_SERVE_HAPROXY_HEALTH_CHECK_INTER=5s # matches
# RAY_SERVE_HAPROXY_METRICS_PORT=9101 # matches
# stats port 8404 # matches (hardcoded in HAProxyConfig)
Gaps — no env var available
┌────────────────────────────────────────────┬─────────────────────────────────────────┬────────────────────────────────────────────────────────────────┐
│ Your setting │ Ray behavior │ Impact │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ maxconnrate / maxsessrate (replicas * 512) │ Not supported │ No global rate limiting in Ray's config │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ timeout http-request 30s │ None (omitted) │ No protection against slow request headers │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ timeout queue 60s │ None (omitted) │ No limit on time a request waits in queue │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ option redispatch │ Not included │ Failed request won't retry on another server │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ retries 1 │ Not included │ No retry on connection failure │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ option http-keep-alive │ Controlled by template (likely present) │ Probably fine │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ http-reuse aggressive │ always │ always is actually more aggressive despite the name — no issue │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ Health check path /health │ Hardcoded /-/healthz │ Checks Ray's built-in health, not your app's /health endpoint │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ Per-server maxconn 32 │ Not supported │ No per-replica connection cap for backpressure │
├────────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ downinter absent │ 250ms (always emitted) │ Ray checks downed servers faster than yours would — harmless │
└────────────────────────────────────────────┴─────────────────────────────────────────┴────────────────────────────────────────────────────────────────┘
Bottom line
The 5 env vars above get you most of the way. The meaningful gaps are:
1. maxconnrate/maxsessrate — your rate limiting won't be there
2. timeout http-request and timeout queue — no slow-client or queue-depth protection
3. Health check path — Ray checks /-/healthz (is the proxy alive?) not /health (is your model loaded?)
4. Per-server maxconn 32 — no backpressure per replica; a slow replica can absorb all connections
Items 3 and 4 are the most operationally significant. Without per-server maxconn, you lose the pairing with max_ongoing_requests: 64 that keeps replicas from being overwhelmed. And /-/healthz won't tell
HAProxy that a replica hasn't loaded its model yet.
(As claude says, we have max_ongoing_requests set to 64.)
We disagree with claude in its prioritization - Setting max conn rate and session rate and timeout http-request and queue are also quite important.
Looking for discussion / feedback before opening a jira request.
Thanks!