Grafana Dashboard Issues

adb_kwal · August 20, 2024, 9:29pm

I’m running kuberay on AKS and I’m trying to debug some unexpected node crashes under load, so i’m trying to get all the observability I can but i’m running into some issues.

I’ve gone through this guide: Using Prometheus and Grafana — Ray 2.34.0.

Most things seem to be working ok, just a couple issues:

First, i’m having trouble with the Node Count pane not getting any data. Digging down into the underlying query reveals it’s using the autoscaler_active_nodes metric.

However, I don’t have any such metrics available. I can find ray_cluster_active_nodes and ray_cluster_pending_nodes, but not sure of the consequences of using those over the autoscaler metrics. Or if there is even a difference

If there is a config option to enable exporting autoscaler metrics I am unable to find it.
(I’m using autoscaler v1)

Second, the ‘Scheduled Actor State’ panel seems to be off.

During this time period no actors were active, as is reported on the Active Actors by Name panel.

Another example is the ‘Scheduled Actor State’ panel showing 5 dependencies_unready status actors, but during this time period all we have is 5 dead actors reported by the cluster in the dashboard. (only allowed 1 screenshot as i’m new here)

These 2 panels not agreeing is confusing because they use the same underlying metric, not sure what I’m missing here.

Any insights or advice would be greatly appreciated, i’m still quite new to ray

Topic		Replies	Views
Ray dashboard is hanging Dashboard, Monitoring & Debugging	10	1185	June 1, 2023
[Dashboard] Missing Physical Resources Dashboard, Monitoring & Debugging	10	381	May 7, 2024
Autoscaler SDK request_resoures fails on EKS Kubernetes	8	570	February 16, 2021
Is there any grafana dashboard best practice of Ray? Dashboard, Monitoring & Debugging	0	704	July 27, 2021
[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Kubernetes	0	32	September 10, 2024

Grafana Dashboard Issues

Related topics