Hey guys, I’m new to using Ray and found the Dashboard to be very exciting - I started to set it up with Prometheus + Grafana by following this guide: Configuring and Managing Ray Dashboard — Ray 2.9.0
On my end I can successfully view the Dashboard, so it seems Grafana, Prometheus, and Ray all successfully work in conjunction.
The issue is that while data is reported for all the logical resources, most of the physical resources (Node CPU, Node Memory, Node Disk, etc.) show no data reported for them.
The only physical resource that has data is Node Count.
Would there be any reason the physical resource usage is missing? Viewing their usage was my main motivation to give Ray Dashboard a try.
Ray doesn’t do anything special here. It collects those physical resource usage using some standard libraries and expose them in the corresponding ports to prometheus to scape for.
- Can you check if you can see the real-time view of the physical resource usage in ray dashboard’s cluster page?
- Can you check if you can see those metrics in prometheus and grafana?
In the cluster page, the physical resource usage for CPU shows 0%, and the Memory is simply left blank (shows nothing).
The Grafana plots match my dashboard’s plots exactly it seems. The physical resources are all still missing (besides Node Count).
Prometheus is a bit more involved, but I believe the physical metrics are missing too.
In the temp folder each ray session creates, one can find a default_grafana_dashboard.json.
In that file, logical cpu is calculated from ray_resources, which I can successfully query on Prometheus.
Also in that file, physical memory is calculated from ray_node_mem_used - I cannot query this variable though (Empty query result), and physical cpu is calculated from ray_component_cpu_percentage, which queries to 0.
- In the cluster page, the physical resource usage for CPU shows 0%, and the Memory is simply left blank (shows nothing).
It seems that the physical usage was not collected properly? What are the machines, OS, Ray versions are you using?
@sangcho can you help with it?
Probably related to Windows.
If you look at dashboard.log and dashboard_agent.log, are there any stacktrace?
In the dashboard_agent.log there is an error with “publishing node physical stats.”
After a quick search it turns out to be the same exact error described here: [Dashboard] shows nothing · Issue #41081 · ray-project/ray · GitHub
That should be fixed though. Can you follow up in that issue and ask? @matt7