Actor scheduling latency and unexpected worker node crashes under heavy distributed mobile testing simulations

Hey everyone,

I’m currently running a large-scale performance testing pipeline distributed across a multi-node Ray cluster, and I’ve run into a serious actor bottleneck that I can’t seem to isolate.

The goal of the setup is to simulate realistic, massive traffic bursts hitting a web application platform to monitor backend response stability. During these test windows, we have automated client nodes configured so that hundreds of virtual workers visit this site simultaneously from specialized mobile sandbox environments. These virtual workers run continuous, high-frequency mobile script execution loops and automation frameworks to mimic complex client behaviors in real time.

The issue is that when these parallel client instances start pushing intense streams of performance log metrics back to our Ray cluster via remote actors, the Ray head node struggles with actor scheduling latency. Within a few minutes of a high-throughput run, we see a cascading failure: several worker nodes suddenly lose heartbeat connectivity, and Ray throws RayActorError exceptions stating that workers died unexpectedly due to out-of-memory (OOM) pressure.

It looks like the Object Store is getting choked by the massive influx of tiny, rapid JSON objects being passed from the mobile script logs before they can be aggregated and downsampled.

Has anyone else ran into scheduling lag or unexpected node drops when using Ray to coordinate high-concurrency client automation or mobile emulation metrics? Should I be looking into tuning the object store memory limits (object_store_memory) on each worker node, or is it better practice to implement a stateless queuing architecture (like Ray Queue) to throttle the incoming data payload size before it hits the actors?

Any insights on how to handle high-frequency, machine-generated mobile data streams across a Ray cluster would be an absolute life-saver!

Your issue is a known Ray scaling pain point: high-frequency, small-object traffic can overwhelm the object store, causing OOM and actor scheduling failures. The Ray object store is optimized for fewer, larger objects; rapid streams of tiny JSONs quickly fill shared memory, triggering OOM and node instability, especially under bursty loads. This is compounded by actor scheduling lag as the head node becomes a bottleneck for actor creation and log ingestion, leading to cascading worker failures and RayActorError exceptions. Tuning object_store_memory can help, but it’s not a complete solution—Ray’s memory monitor may still kill workers if heap usage spikes, and the object store will spill only large objects, not many small ones. See Debugging Memory Issues and OOM Prevention.

Best practice is to aggregate or batch logs before sending them to Ray actors. Implementing a stateless queue or buffer (e.g., using Ray Queue or an external message broker) to batch and throttle incoming metrics will drastically reduce object store pressure and scheduling lag. You can also set per-dataset or per-actor object store memory limits and use Ray Data’s backpressure mechanisms to slow down production when memory is tight (Data Loading & Backpressure). Avoid sending raw, high-frequency logs directly to actors; instead, downsample or compress them client-side, then send larger, less frequent payloads. Would you like more detail on batching strategies or Ray’s memory configuration for this use case?

Sources:

Hint: Mention @RunLLM in the post for followups.