Ray dashboard lag and worker heartbeat timeouts when running background automation environments

Hi everyone,

I’ve been running some distributed training jobs across a local cluster lately, but I’ve recently hit a persistent technical snag where the Ray Dashboard becomes incredibly sluggish and I’m starting to see sporadic “Worker heartbeat timeout” errors in my logs, even when the cluster isn’t at full capacity.

I am currently using a https://deltaexeutor.com/vng/ community setup to handle some automated local logging and script-based data scrubbing for my experiments in the background while the Ray head node is active. I’ve noticed that whenever the background executor starts a heavy processing cycle, the dashboard’s metric updates start to “hang.” It’s particularly frustrating when I’m trying to monitor resource utilization or debug a failed actor, as the UI often triggers a “Page Unresponsive” error or the GCS seems to struggle with maintaining the socket connection.

It feels like the background process is competing for the same system resources—specifically CPU threads and memory bandwidth—that the Ray head node needs for its internal scheduling and the dashboard’s real-time state management. I have a few related concerns for the distributed systems experts here:

I’m not sure if the way a style environment manages its internal threading is causing a direct resource contention issue with the gRPC calls or the specific local drivers used for secure socket communication. Has anyone else encountered performance bottlenecks or “Socket Connection” timeouts while running high-level script executors alongside their Ray head nodes? I am also wondering if there is a recommended way to isolate the executor’s CPU priority so it doesn’t “starve” the resources needed for stable worker heartbeats and dashboard responsiveness.

I really need to keep this automation active to stay on top of my project logs, but the constant UI lag is making it difficult to keep an eye on my scaling tasks. If anyone has experience optimizing a professional workstation or a head node for concurrent usage of heavy script executors and Ray, I’d love to hear your advice!

Thanks for the help!

Your issue is likely due to resource contention between your background executor and the Ray head node, especially around CPU scheduling and memory bandwidth. When a heavy background process runs, it can delay Ray’s internal heartbeats and dashboard updates, leading to “Worker heartbeat timeout” errors and dashboard sluggishness—even if the cluster isn’t fully loaded. This is a known problem: high disk or CPU I/O from other processes can cause Ray’s heartbeat or dashboard threads to be delayed, resulting in nodes being marked dead or the dashboard becoming unresponsive. Prioritizing Ray system processes (raylet, GCS server) using OS-level niceness or cgroup settings can help mitigate this by ensuring they get CPU time even under load. For example, you can use sudo renice -n -19 [raylet_pid] to increase the scheduling priority of Ray processes, as suggested in this Ray issue.

Additionally, for large or long-running clusters, it’s recommended to reserve extra resources for the head node and avoid running user workloads or heavy background jobs on it. Set num_cpus=0 for the head node to prevent Ray from scheduling tasks there, and consider running your background executor with a lower CPU priority (using nice or cpulimit) or pinning it to specific cores not used by Ray. This helps ensure Ray’s control plane and dashboard remain responsive and worker heartbeats are not missed due to resource starvation. See Ray best practices for large clusters and heartbeat timeout discussions for more details.

Would you like a step-by-step guide on how to set process priorities or isolate resources for your background executor and Ray head node?

Sources:

Hint: Mention @RunLLM in the post for followups.