Collect metrics across clusters

bharatjUber · August 20, 2024, 10:45pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello,
We spin up multiple Ray clusters to support out internal users. We’d like to expose the Ray system metrics System Metrics — Ray 2.34.0 to these users.

We ingest all these metrics into a metrics DB service for efficient querying. To enable this, we need a way to distinguish the metrics for each cluster uniquely. How can I achieve this? In other words, is there a way to tag ALL the metrics for a given cluster with a UUID?

I explored using Ray namespaces Using Namespaces — Ray 2.34.0 for this purpose and passed that as an argument into ray.init. This was with the hope that namespace will be emitted as a label on the metrics. But that doesn’t seem to be happening.

Thanks.

aguo · August 28, 2024, 6:08pm

Hi, SessionName is a label we have with every metric and it’s a unique identifier per-cluster. Hopefully that will work for you.

One alternative strategy (something we do at Anyscale) is to ingest metrics and apply labels on top that identify cluster with some pre-determined ID. We use vector to ingest metrics and configure it to apply a different cluster id for each cluster.

bharatjUber · August 28, 2024, 8:21pm

Thanks for your response @aguo . However I’m seeing several metrics that do not have this label. Some example are pasted below from a test Ray cluster.

Yes, applying additional labels is a good idea. I’m exploring that with Prometheus metric relabeling feature.

> ray_heartbeat_report_ms_bucket{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId="",le="+Inf"} 389.0
> 
> ray_heartbeat_report_ms_count{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 389.0
> 
> ray_heartbeat_report_ms_sum{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 388084.00000000006
> 
> # HELP ray_internal_num_spilled_tasks The cumulative number of lease requeusts that this raylet has spilled to other raylets.
> 
> # TYPE ray_internal_num_spilled_tasks gauge
> 
> ray_internal_num_spilled_tasks{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 0.0
> 
> # HELP ray_internal_num_processes_started The total number of worker processes the worker pool has created.
> 
> # TYPE ray_internal_num_processes_started gauge
> 
> ray_internal_num_processes_started{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 1.0
> 
> # HELP ray_internal_num_infeasible_scheduling_classes The number of unique scheduling classes that are infeasible.
> 
> # TYPE ray_internal_num_infeasible_scheduling_classes gauge
> 
> ray_internal_num_infeasible_scheduling_classes{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 0.0
> 
> # HELP ray_pull_manager_requests Number of pull requests broken per type {Queued, Active, Pinned}.
> 
> # TYPE ray_pull_manager_requests gauge
> 
> ray_pull_manager_requests{Component="raylet",JobId="",NodeAddress="10.86.49.51",Type="Queued",Version="3.0.0.dev0",WorkerId=""} 0.0
> 
> # HELP ray_pull_manager_requests Number of pull requests broken per type {Queued, Active, Pinned}.
> 
> # TYPE ray_pull_manager_requests gauge
> 
> ray_pull_manager_requests{Component="raylet",JobId="",NodeAddress="10.86.49.51",Type="Active",Version="3.0.0.dev0",WorkerId=""} 0.0

sangcho · August 28, 2024, 8:32pm

what version of ray are you using? I can’t find this metrics from the latest ray release

Topic		Replies	Views
Custom metrics works on local Ray but not on the cluster Ray Workflows	5	64	July 19, 2024
Some system metrics are unavailable Dashboard, Monitoring & Debugging	1	9	May 26, 2025
Monitoring hardware utilization of workers Dashboard, Monitoring & Debugging	7	1008	March 8, 2022
Exposing KubeRay prometheus metrics configuration on head service annotations Dashboard, Monitoring & Debugging	7	1712	September 8, 2023
Ray metrics and Prometheus Dashboard, Monitoring & Debugging	1	920	May 25, 2021

Collect metrics across clusters

Related topics