Ray dashboard is hanging

Keshi_Dai · May 20, 2022, 6:59pm

Hello, I noticed that when I’m running a job on the Ray cluster on GKE (via KubeRay), the dashboard on the head node takes a very long time to load. I didn’t schedule any workloads running on the head node, but 25 workers on the cluster were running at full CPU capacity.

When I checked the processes running on the head node, the dashboard was using >100% CPU. Please let me know if this is Ray dashboard bug, or if there’s anything I can do to get around with this issue? Thanks!

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

aguo · May 20, 2022, 8:13pm

Hi Keshi_Dai,

The current way the dashboard process is implemented is that there is a process in the head node that is constantly aggregating data from the workers. We’re currently undergoing a project to rebuild a lot of the observability stack which would help improve the performance of this in the coming months. @sangcho can provide more details about it.

@Dmitri , is there anything unique about kuberay which may be causing this issue to be worse when using k8s?

Keshi_Dai · May 20, 2022, 8:36pm

Thanks @aguo! That’s good to know. What’s the timeline for this? Is there any workaround for now?

aguo · May 20, 2022, 8:49pm

@sangcho will be able to provide info about a timeline.

For workarounds, you can try using a smaller cluster or a bigger head node.

Dmitri · May 21, 2022, 4:40am

@Dmitri , is there anything unique about kuberay which may be causing this issue to be worse when using k8s?

I just saw this myself with Ray 1.12.1 @Keshi_Dai how big is your head pod?

Are you able to get a simple reproduction of the issue?

Dmitri · May 21, 2022, 4:42am

Large resource utilization from the dashboard is unfortunate for experiments in resource-constrained K8s environments (e.g. minikube).

Dmitri · May 21, 2022, 4:50am

Actually, the dashboard issue I’m seeing must be something different – I don’t see the high CPU usage (I’m just seeing the dashboard appear then immediately go white) in Ray 1.12 but not Ray 1.11

Keshi_Dai · May 23, 2022, 1:10pm

@Dmitri @aguo I’m on Ray 1.12.0

My head pod has pretty decent size with 20G memory and 12 CPUs. It’s not caused by the head pod resource constraint. I configured my head node to not schedule the workloads on it. It’s the python process that runs the dashboard application, which is running at 100% capacity all the time. The job I’m running is quite big that triggers 20K tasks in parallel with 25 workers (8 CPUs and 32G memory)

Dmitri · May 23, 2022, 4:03pm

Thanks for the context! I’ll try to reproduce the issue.
@sangcho is it possible that this is related to state observability work?

Dmitri · May 24, 2022, 12:46am

The temporary workaround may be to use an even bigger head node.
It sounds like maybe there’s some newly added logic in the dashboard that isn’t scaling well relative to the Ray workload.

sangcho · June 1, 2023, 2:18pm

In the latest versions, most of slow dashboard issues should’ve been resolved. I believe from Ray 2.1, or 2.2, it should be fixed. Can you try those versions?

Topic		Replies	Views
Ray 1.3.0 dashboard in Kubernetes Dashboard, Monitoring & Debugging	1	709	April 26, 2021
Dashboard crashed during autoscale Dashboard, Monitoring & Debugging	4	774	October 13, 2022
Missing Ray Dashboard Dashboard, Monitoring & Debugging	7	1836	May 15, 2023
Is there any grafana dashboard best practice of Ray? Dashboard, Monitoring & Debugging	0	707	July 27, 2021
“ray start --head” succeed but "ray status" could not find any running ray instance Ray Clusters	13	1914	July 1, 2022

Ray dashboard is hanging

Related topics