Monitoring hardware utilization of workers

Hi,

I am trying to figure out how can I monitor CPU/RAM/GPU/Disk etc usage per node/worker/process in RayCluster.

I have two use cases in my mind:

  1. Running distributed jobs using dask on ray and I would be more than happy to get the hardware utilization of Ray Cluster during processing the Dask jobs.
  2. If 1 is not possible, I would be satisfied if I could get the whole worker/node utilization.

I would see the solution as starting a new thread before starting Dask on ray, which would be gathering Ray Cluster metrics and then log to some file/MLFLow/whatever.

I saw in docs there is Exporting Metrics part, but this seems like intended for custom metrics while using Ray. On the Ray Dashboard there is CPU/RAM/Disk usage already, so I wonder, maybe there is ready to use API to get the metrics of cluster members?
Like iterating over all ray workers and .get_metrics() methods?

Finally, I would like to create a dashboard similar to this one: wandb-example, but with Ray Workers utilization plots.

@sangcho tagging you as you helped me recently and seems like you are everywhere here :wink:

1 Like

The metrics export should export the hardware utilization per node by default!

Ray dashboard also contains this information.

@sangcho Thank you. The endpoint for Prometheus is indeed scrapable, so I assume I should scrap the metric endpoints and extract the data from it? There is not other API for that, right?

I see that there is MetricsAgent class in dashboard module, so I will give it a try to make my live easier probably, unless there are other recommendations to handling metrics logging.

You can try Exporting Metrics — Ray 2.0.0.dev0! We have a way to make scraping a bit easier (Exporting Metrics — Ray 2.0.0.dev0)

Thanks for the suggestions, but the 2.0 version does not seem to be solving my problem. The scraping of endpoint needs to be done using Prometheus and I don’t want to use Prometheus, I would like to use some Python class that would query the metrics-endpoint, parse the response and log selected metrics into MLFlow or other logging solution.

Sorry, maybe I am wrong and I am missing something?

Hmm I am not sure what you meant by log selected metrics. But right now, the only way to export built-in metrics is to use Prometheus. cc @xwjiang2010 do you know more details about MLFlow case here?

I’d like to log the metrics into MLFlow and not Prometheus. So assuming CPU/RAM/GPU etc metrics are available to Prometheus I’d like to extract the same metrics by Python Class and then log them inside Python code.

Right now, the metrics are only exposed to the Prometheus format. There are potential workarounds;

  • Scrape prometheus endpoint directly and log it to MLFlow. You can access the prometheus endpoint from the URL written in this file Exporting Metrics — Ray 2.0.0.dev0
  • Scrape metrics to prometheus first and redirect them to MLFlow.
  • Collect per-node stats on your own. One of the possible approaches is to create an actor per node and collect CPU/memory/disk/GPU usage on your own and log to MLFlow (code to collect metrics that we use is here ray/reporter_agent.py at master · ray-project/ray · GitHub)

Also @xwjiang2010 please give us an advice if there are other good ways to achieve this!