Hey everyone,
I’m currently running a large-scale performance testing pipeline distributed across a multi-node Ray cluster, and I’ve run into a serious actor bottleneck that I can’t seem to isolate.
The goal of the setup is to simulate realistic, massive traffic bursts hitting a web application platform to monitor backend response stability. During these test windows, we have automated client nodes configured so that hundreds of virtual workers visit this site simultaneously from specialized mobile sandbox environments. These virtual workers run continuous, high-frequency mobile script execution loops and automation frameworks to mimic complex client behaviors in real time.
The issue is that when these parallel client instances start pushing intense streams of performance log metrics back to our Ray cluster via remote actors, the Ray head node struggles with actor scheduling latency. Within a few minutes of a high-throughput run, we see a cascading failure: several worker nodes suddenly lose heartbeat connectivity, and Ray throws RayActorError exceptions stating that workers died unexpectedly due to out-of-memory (OOM) pressure.
It looks like the Object Store is getting choked by the massive influx of tiny, rapid JSON objects being passed from the mobile script logs before they can be aggregated and downsampled.
Has anyone else ran into scheduling lag or unexpected node drops when using Ray to coordinate high-concurrency client automation or mobile emulation metrics? Should I be looking into tuning the object store memory limits (object_store_memory) on each worker node, or is it better practice to implement a stateless queuing architecture (like Ray Queue) to throttle the incoming data payload size before it hits the actors?
Any insights on how to handle high-frequency, machine-generated mobile data streams across a Ray cluster would be an absolute life-saver!