Is Ray Core suitable for long-running, I/O-heavy streaming with 200+ actors per cluster?

Tanu_Shree · June 22, 2026, 6:31pm

Question:

We’re evaluating whether Ray Core is the right fit for our use case and would appreciate guidance from the community.

What we’re building:

A real-time streaming pipeline that consumes messages from a queue, processes them (light compute + database read/write), and runs 24/7 indefinitely. The workload is heavily I/O-bound — the majority of time per message is spent on database reads/writes (network I/O), with relatively light Python compute.

Why we chose Ray Core:

Needed long-lived stateful Python processes (actors)
Actor model provides natural sequential processing per actor
Built-in actor failure detection and restart

Our setup:

Ray Core only (no Ray Serve, no Ray Data)
~70 actors per node, ~1000 actors per cluster depending on traffic
Fire-and-forget .remote() calls (no ray.get() — ObjectRefs discarded immediately)
Each actor receives ~1000 .remote() calls/sec from multiple producers
Cluster runs continuously for weeks/months — never restarted

Our concerns:

1. Is Ray the right tool for I/O-heavy streaming? Most Ray use cases we see are compute-heavy (ML training, batch inference). Our workload is predominantly I/O wait. Are there known inefficiencies in using Ray actors as long-running I/O-bound message processors vs alternatives (plain gRPC services, asyncio)?

2. GCS with 1000 actors running indefinitely — is there known degradation over time (memory growth, control loop slowdown) for long-lived actors? Community reports mention GCS issues at 7-10 days, but those involved billions of short-lived tasks. Does the same risk apply to long-lived actors with ongoing .remote() calls?

3. Actor mailbox at ~1000 calls/sec — calls are fire-and-forget (no ray.get()). Is there a threshold where mailbox depth or .remote() call rate impacts GCS or raylet — even when ObjectRefs are immediately discarded?

4. Multi-cluster pattern — we’re considering multiple smaller RayClusters (8-10 clusters, each with ~1000 actors) on one EKS cluster. Is this recommended for long-running workloads? Any known issues with running many persistent RayClusters simultaneously under one KubeRay operator?

5. Long-running maintenance — with clusters running for weeks/months, are there recommended practices (periodic restart, GCS monitoring thresholds, known memory patterns)?

RunLLM · June 22, 2026, 6:32pm

It seems likely that Ray Core can be used for long-lived, I/O-heavy streaming pipelines with actors, but there are important caveats and best practices to consider:

For I/O-bound workloads, async actors are generally recommended over tasks, as they avoid process-level overhead and better utilize resources for network I/O. Ray actors can be used for long-running pollers and message processors, but you should use async actors to maximize throughput and efficiency for I/O-heavy workloads, as discussed in this thread and here.
Regarding GCS and long-lived actors: most reported GCS degradation issues are related to clusters with massive numbers of short-lived tasks/actors, leading to unbounded metadata growth. For clusters with a stable set of long-lived actors, there is less risk, but you should still monitor GCS memory and dashboard responsiveness, as some users have reported slowdowns in long-running clusters (example, example). There is no explicit evidence that long-lived actors alone cause GCS issues, but best practice is to monitor for memory growth and consider periodic restarts if you observe degradation.

Would you like more detail on any of these points, or on mailbox throughput, multi-cluster patterns, or maintenance recommendations?

Sources:

Hint: Mention @Herald in the post for followups.

Tanu_Shree · June 23, 2026, 3:25am

Thanks for the response. A few follow-ups:

1. Async actors — clarification: We are using threading within actors for I/O parallelism (threads release GIL during network calls), but not Ray’s native async actors (async def methods). Would switching to async actors with max_concurrency help reduce mailbox pressure at ~1000 .remote() calls/sec per actor? Is there a recommended max_concurrency value for I/O-bound workloads at this call rate?

2. Actor mailbox throughput — this is the main concern I’d like guidance on. With ~1000 fire-and-forget .remote() calls/sec per actor (no ray.get()), is there a known threshold where the actor mailbox itself becomes a bottleneck — causing backpressure, memory growth, or raylet overhead? Or is the mailbox effectively unbounded for fire-and-forget calls?

3. Multi-cluster pattern — for our scale (8-10 persistent RayClusters on one EKS cluster, each with ~1000 actors, running indefinitely), is this a pattern others have successfully run in production? Any concerns with KubeRay operator managing many simultaneous long-running clusters?

4. GCS monitoring — what specific metrics should we monitor to detect early signs of degradation in a long-running cluster? Are there recommended thresholds for alerting (GCS memory, control loop latency, scheduling delay)?

Topic		Replies	Views
What is the best approach for long running IO tasks (pollers)? Ray Core	1	605	August 9, 2021
Best practices for long-running Ray clusters with extremely high task throughput - GCS metadata accumulation causing scheduling delays Ray Core	1	194	November 28, 2025
Ray actor only uses one core on a cluster managed using SLURM Ray Clusters	1	461	September 16, 2021
Actors pool - process stuck / tasks lost on a long run Ray Core	4	679	February 24, 2022
How does Ray actor work? Kubernetes	1	106	September 2, 2025

Is Ray Core suitable for long-running, I/O-heavy streaming with 200+ actors per cluster?

Related topics