Monitoring hardware utilization of workers

mmww · February 22, 2022, 8:28pm

Hi,

I am trying to figure out how can I monitor CPU/RAM/GPU/Disk etc usage per node/worker/process in RayCluster.

I have two use cases in my mind:

Running distributed jobs using dask on ray and I would be more than happy to get the hardware utilization of Ray Cluster during processing the Dask jobs.
If 1 is not possible, I would be satisfied if I could get the whole worker/node utilization.

I would see the solution as starting a new thread before starting Dask on ray, which would be gathering Ray Cluster metrics and then log to some file/MLFLow/whatever.

I saw in docs there is Exporting Metrics part, but this seems like intended for custom metrics while using Ray. On the Ray Dashboard there is CPU/RAM/Disk usage already, so I wonder, maybe there is ready to use API to get the metrics of cluster members?
Like iterating over all ray workers and .get_metrics() methods?

Finally, I would like to create a dashboard similar to this one: wandb-example, but with Ray Workers utilization plots.

@sangcho tagging you as you helped me recently and seems like you are everywhere here

sangcho · February 24, 2022, 10:48pm

The metrics export should export the hardware utilization per node by default!

Ray dashboard also contains this information.

mmww · March 1, 2022, 5:51pm

@sangcho Thank you. The endpoint for Prometheus is indeed scrapable, so I assume I should scrap the metric endpoints and extract the data from it? There is not other API for that, right?

I see that there is MetricsAgent class in dashboard module, so I will give it a try to make my live easier probably, unless there are other recommendations to handling metrics logging.

sangcho · March 2, 2022, 10:28pm

You can try Exporting Metrics — Ray 2.0.0.dev0! We have a way to make scraping a bit easier (Exporting Metrics — Ray 2.0.0.dev0)

mmww · March 7, 2022, 12:27pm

Thanks for the suggestions, but the 2.0 version does not seem to be solving my problem. The scraping of endpoint needs to be done using Prometheus and I don’t want to use Prometheus, I would like to use some Python class that would query the metrics-endpoint, parse the response and log selected metrics into MLFlow or other logging solution.

Sorry, maybe I am wrong and I am missing something?

sangcho · March 7, 2022, 12:44pm

Hmm I am not sure what you meant by log selected metrics. But right now, the only way to export built-in metrics is to use Prometheus. cc @xwjiang2010 do you know more details about MLFlow case here?

mmww · March 7, 2022, 1:14pm

I’d like to log the metrics into MLFlow and not Prometheus. So assuming CPU/RAM/GPU etc metrics are available to Prometheus I’d like to extract the same metrics by Python Class and then log them inside Python code.

sangcho · March 8, 2022, 12:27am

Right now, the metrics are only exposed to the Prometheus format. There are potential workarounds;

Scrape prometheus endpoint directly and log it to MLFlow. You can access the prometheus endpoint from the URL written in this file Exporting Metrics — Ray 2.0.0.dev0
Scrape metrics to prometheus first and redirect them to MLFlow.
Collect per-node stats on your own. One of the possible approaches is to create an actor per node and collect CPU/memory/disk/GPU usage on your own and log to MLFlow (code to collect metrics that we use is here ray/reporter_agent.py at master · ray-project/ray · GitHub)

Also @xwjiang2010 please give us an advice if there are other good ways to achieve this!

Topic		Replies	Views
How to collect the resources usage in job level? Dashboard, Monitoring & Debugging	2	538	August 21, 2023
Are the statistics on Ray Dashboard programmatically available? Dashboard, Monitoring & Debugging	2	601	February 4, 2021
[Dashboard] Missing Physical Resources Dashboard, Monitoring & Debugging	10	387	May 7, 2024
How to programatically do real-time monitoring of actor/task resource usage (heap memory/obj store memory/cpu)? Dashboard, Monitoring & Debugging	7	942	July 4, 2024
Dask on Ray + Ray Distributed Cluster - Workers not getting used? Ray Core	9	713	February 14, 2021

Monitoring hardware utilization of workers

Related topics