Ray dashboard is hanging

Hello, I noticed that when I’m running a job on the Ray cluster on GKE (via KubeRay), the dashboard on the head node takes a very long time to load. I didn’t schedule any workloads running on the head node, but 25 workers on the cluster were running at full CPU capacity.

When I checked the processes running on the head node, the dashboard was using >100% CPU. Please let me know if this is Ray dashboard bug, or if there’s anything I can do to get around with this issue? Thanks!

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi Keshi_Dai,

The current way the dashboard process is implemented is that there is a process in the head node that is constantly aggregating data from the workers. We’re currently undergoing a project to rebuild a lot of the observability stack which would help improve the performance of this in the coming months. @sangcho can provide more details about it.

@Dmitri , is there anything unique about kuberay which may be causing this issue to be worse when using k8s?

Thanks @aguo! That’s good to know. What’s the timeline for this? Is there any workaround for now?

@sangcho will be able to provide info about a timeline.

For workarounds, you can try using a smaller cluster or a bigger head node.

@Dmitri , is there anything unique about kuberay which may be causing this issue to be worse when using k8s?

I just saw this myself with Ray 1.12.1 @Keshi_Dai how big is your head pod?

Are you able to get a simple reproduction of the issue?

Large resource utilization from the dashboard is unfortunate for experiments in resource-constrained K8s environments (e.g. minikube).

Actually, the dashboard issue I’m seeing must be something different – I don’t see the high CPU usage (I’m just seeing the dashboard appear then immediately go white) in Ray 1.12 but not Ray 1.11

1 Like

@Dmitri @aguo I’m on Ray 1.12.0

My head pod has pretty decent size with 20G memory and 12 CPUs. It’s not caused by the head pod resource constraint. I configured my head node to not schedule the workloads on it. It’s the python process that runs the dashboard application, which is running at 100% capacity all the time. The job I’m running is quite big that triggers 20K tasks in parallel with 25 workers (8 CPUs and 32G memory)

Thanks for the context! I’ll try to reproduce the issue.
@sangcho is it possible that this is related to state observability work?

The temporary workaround may be to use an even bigger head node.
It sounds like maybe there’s some newly added logic in the dashboard that isn’t scaling well relative to the Ray workload.

1 Like