Collect metrics across clusters

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello,
We spin up multiple Ray clusters to support out internal users. We’d like to expose the Ray system metrics System Metrics — Ray 2.34.0 to these users.

We ingest all these metrics into a metrics DB service for efficient querying. To enable this, we need a way to distinguish the metrics for each cluster uniquely. How can I achieve this? In other words, is there a way to tag ALL the metrics for a given cluster with a UUID?

I explored using Ray namespaces Using Namespaces — Ray 2.34.0 for this purpose and passed that as an argument into ray.init. This was with the hope that namespace will be emitted as a label on the metrics. But that doesn’t seem to be happening.

Thanks.

1 Like

Hi, SessionName is a label we have with every metric and it’s a unique identifier per-cluster. Hopefully that will work for you.

One alternative strategy (something we do at Anyscale) is to ingest metrics and apply labels on top that identify cluster with some pre-determined ID. We use vector to ingest metrics and configure it to apply a different cluster id for each cluster.

1 Like

Thanks for your response @aguo . However I’m seeing several metrics that do not have this label. Some example are pasted below from a test Ray cluster.

Yes, applying additional labels is a good idea. I’m exploring that with Prometheus metric relabeling feature.

> ray_heartbeat_report_ms_bucket{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId="",le="+Inf"} 389.0
> 
> ray_heartbeat_report_ms_count{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 389.0
> 
> ray_heartbeat_report_ms_sum{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 388084.00000000006
> 
> # HELP ray_internal_num_spilled_tasks The cumulative number of lease requeusts that this raylet has spilled to other raylets.
> 
> # TYPE ray_internal_num_spilled_tasks gauge
> 
> ray_internal_num_spilled_tasks{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 0.0
> 
> # HELP ray_internal_num_processes_started The total number of worker processes the worker pool has created.
> 
> # TYPE ray_internal_num_processes_started gauge
> 
> ray_internal_num_processes_started{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 1.0
> 
> # HELP ray_internal_num_infeasible_scheduling_classes The number of unique scheduling classes that are infeasible.
> 
> # TYPE ray_internal_num_infeasible_scheduling_classes gauge
> 
> ray_internal_num_infeasible_scheduling_classes{Component="raylet",JobId="",NodeAddress="10.86.49.51",Version="3.0.0.dev0",WorkerId=""} 0.0
> 
> # HELP ray_pull_manager_requests Number of pull requests broken per type {Queued, Active, Pinned}.
> 
> # TYPE ray_pull_manager_requests gauge
> 
> ray_pull_manager_requests{Component="raylet",JobId="",NodeAddress="10.86.49.51",Type="Queued",Version="3.0.0.dev0",WorkerId=""} 0.0
> 
> # HELP ray_pull_manager_requests Number of pull requests broken per type {Queued, Active, Pinned}.
> 
> # TYPE ray_pull_manager_requests gauge
> 
> ray_pull_manager_requests{Component="raylet",JobId="",NodeAddress="10.86.49.51",Type="Active",Version="3.0.0.dev0",WorkerId=""} 0.0

what version of ray are you using? I can’t find this metrics from the latest ray release