On my end I can successfully view the Dashboard, so it seems Grafana, Prometheus, and Ray all successfully work in conjunction.
The issue is that while data is reported for all the logical resources, most of the physical resources (Node CPU, Node Memory, Node Disk, etc.) show no data reported for them.
The only physical resource that has data is Node Count.
Would there be any reason the physical resource usage is missing? Viewing their usage was my main motivation to give Ray Dashboard a try.
Ray doesn’t do anything special here. It collects those physical resource usage using some standard libraries and expose them in the corresponding ports to prometheus to scape for.
Can you check if you can see the real-time view of the physical resource usage in ray dashboard’s cluster page?
Can you check if you can see those metrics in prometheus and grafana?
In the cluster page, the physical resource usage for CPU shows 0%, and the Memory is simply left blank (shows nothing).
The Grafana plots match my dashboard’s plots exactly it seems. The physical resources are all still missing (besides Node Count).
Prometheus is a bit more involved, but I believe the physical metrics are missing too.
In the temp folder each ray session creates, one can find a default_grafana_dashboard.json.
In that file, logical cpu is calculated from ray_resources, which I can successfully query on Prometheus.
Also in that file, physical memory is calculated from ray_node_mem_used - I cannot query this variable though (Empty query result), and physical cpu is calculated from ray_component_cpu_percentage, which queries to 0.