Hello, I noticed that when I’m running a job on the Ray cluster on GKE (via KubeRay), the dashboard on the head node takes a very long time to load. I didn’t schedule any workloads running on the head node, but 25 workers on the cluster were running at full CPU capacity.
When I checked the processes running on the head node, the dashboard was using >100% CPU. Please let me know if this is Ray dashboard bug, or if there’s anything I can do to get around with this issue? Thanks!
How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
The current way the dashboard process is implemented is that there is a process in the head node that is constantly aggregating data from the workers. We’re currently undergoing a project to rebuild a lot of the observability stack which would help improve the performance of this in the coming months. @sangcho can provide more details about it.
@Dmitri , is there anything unique about kuberay which may be causing this issue to be worse when using k8s?
Actually, the dashboard issue I’m seeing must be something different – I don’t see the high CPU usage (I’m just seeing the dashboard appear then immediately go white) in Ray 1.12 but not Ray 1.11
My head pod has pretty decent size with 20G memory and 12 CPUs. It’s not caused by the head pod resource constraint. I configured my head node to not schedule the workloads on it. It’s the python process that runs the dashboard application, which is running at 100% capacity all the time. The job I’m running is quite big that triggers 20K tasks in parallel with 25 workers (8 CPUs and 32G memory)
The temporary workaround may be to use an even bigger head node.
It sounds like maybe there’s some newly added logic in the dashboard that isn’t scaling well relative to the Ray workload.
In the latest versions, most of slow dashboard issues should’ve been resolved. I believe from Ray 2.1, or 2.2, it should be fixed. Can you try those versions?