We’re evaluating whether Ray Core is the right fit for our use case and would appreciate guidance from the community.
Whatwe’rebuilding:
A real-time streaming pipeline that consumes messages from a queue, processes them (light compute + database read/write), and runs 24/7indefinitely. The workload is heavily I/O-bound — the majority of time per message is spent on database reads/writes (network I/O), with relatively light Python compute.
Actor model provides natural sequential processing per actor
Built-in actor failure detection and restart
Oursetup:
Ray Core only (no Ray Serve, no Ray Data)
~70 actors per node, ~1000 actors per cluster depending on traffic
Fire-and-forget .remote() calls (no ray.get() — ObjectRefs discarded immediately)
Each actor receives ~1000 .remote() calls/sec from multiple producers
Cluster runs continuously for weeks/months — never restarted
Ourconcerns:
1. IsRaytherighttoolforI/O-heavystreaming? Most Ray use cases we see are compute-heavy (ML training, batch inference). Our workload is predominantly I/O wait. Are there known inefficiencies in using Ray actors as long-running I/O-bound message processors vs alternatives (plain gRPC services, asyncio)?
2. GCSwith1000actorsrunningindefinitely — is there known degradation over time (memory growth, control loop slowdown) for long-lived actors? Community reports mention GCS issues at 7-10 days, but those involved billions of short-lived tasks. Does the same risk apply to long-lived actors with ongoing .remote() calls?
3. Actormailboxat~1000calls/sec — calls are fire-and-forget (no ray.get()). Is there a threshold where mailbox depth or .remote() call rate impacts GCS or raylet — even when ObjectRefs are immediately discarded?
4. Multi-clusterpattern — we’re considering multiple smaller RayClusters (8-10 clusters, each with ~1000 actors) on one EKS cluster. Is this recommended for long-running workloads? Any known issues with running many persistent RayClusters simultaneously under one KubeRay operator?
5. Long-runningmaintenance — with clusters running for weeks/months, are there recommended practices (periodic restart, GCS monitoring thresholds, known memory patterns)?
It seems likely that Ray Core can be used for long-lived, I/O-heavy streaming pipelines with actors, but there are important caveats and best practices to consider:
For I/O-bound workloads, async actors are generally recommended over tasks, as they avoid process-level overhead and better utilize resources for network I/O. Ray actors can be used for long-running pollers and message processors, but you should use async actors to maximize throughput and efficiency for I/O-heavy workloads, as discussed in this thread and here.
Regarding GCS and long-lived actors: most reported GCS degradation issues are related to clusters with massive numbers of short-lived tasks/actors, leading to unbounded metadata growth. For clusters with a stable set of long-lived actors, there is less risk, but you should still monitor GCS memory and dashboard responsiveness, as some users have reported slowdowns in long-running clusters (example, example). There is no explicit evidence that long-lived actors alone cause GCS issues, but best practice is to monitor for memory growth and consider periodic restarts if you observe degradation.
Would you like more detail on any of these points, or on mailbox throughput, multi-cluster patterns, or maintenance recommendations?
1. Async actors — clarification: We are using threading within actors for I/O parallelism (threads release GIL during network calls), but not Ray’s native async actors (async def methods). Would switching to async actors with max_concurrency help reduce mailbox pressure at ~1000 .remote() calls/sec per actor? Is there a recommended max_concurrency value for I/O-bound workloads at this call rate?
2. Actor mailbox throughput — this is the main concern I’d like guidance on. With ~1000 fire-and-forget .remote() calls/sec per actor (no ray.get()), is there a known threshold where the actor mailbox itself becomes a bottleneck — causing backpressure, memory growth, or raylet overhead? Or is the mailbox effectively unbounded for fire-and-forget calls?
3. Multi-cluster pattern — for our scale (8-10 persistent RayClusters on one EKS cluster, each with ~1000 actors, running indefinitely), is this a pattern others have successfully run in production? Any concerns with KubeRay operator managing many simultaneous long-running clusters?
4. GCS monitoring — what specific metrics should we monitor to detect early signs of degradation in a long-running cluster? Are there recommended thresholds for alerting (GCS memory, control loop latency, scheduling delay)?