I’m running kuberay on AKS and I’m trying to debug some unexpected node crashes under load, so i’m trying to get all the observability I can but i’m running into some issues.
I’ve gone through this guide: Using Prometheus and Grafana — Ray 2.34.0.
Most things seem to be working ok, just a couple issues:
First, i’m having trouble with the Node Count
pane not getting any data. Digging down into the underlying query reveals it’s using the autoscaler_active_nodes
metric.
However, I don’t have any such metrics available. I can find ray_cluster_active_nodes
and ray_cluster_pending_nodes
, but not sure of the consequences of using those over the autoscaler metrics. Or if there is even a difference
If there is a config option to enable exporting autoscaler metrics I am unable to find it.
(I’m using autoscaler v1)
Second, the ‘Scheduled Actor State’ panel seems to be off.
During this time period no actors were active, as is reported on the Active Actors by Name
panel.
Another example is the ‘Scheduled Actor State’ panel showing 5 dependencies_unready
status actors, but during this time period all we have is 5 dead actors reported by the cluster in the dashboard. (only allowed 1 screenshot as i’m new here)
These 2 panels not agreeing is confusing because they use the same underlying metric, not sure what I’m missing here.
Any insights or advice would be greatly appreciated, i’m still quite new to ray