How to tracing Ray API performance by Ray exporter?

yzs · September 26, 2022, 2:21pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Problem

Are there any existing metrics including job-related Ray API performance? For example, actor/task call latency, latency and size of arguments of ray.get/put?

If not, are there any other approaches where this information is captured for offline analysis, such as in the output log of raylet or gcs?

My original purpose is to profiling all Ray API performance in an application-agnostic way.

Potential solutions

Ray metrics

Ray exports detailed and informative metrics in Ray metrics doc. The default exported metrics include latencies of core components in Ray, like

...
# HELP ray_grpc_server_req_finished_total Finished request number in grpc server
# TYPE ray_grpc_server_req_finished_total counter
ray_grpc_server_req_finished_total{Component="core_worker",Method="CoreWorkerService.grpc_server.GetCoreWorkerStats",NodeAddress="127.0.0.1",Version="1.13.0"} 484.0
# HELP ray_grpc_server_req_process_time_ms Request latency in grpc server
# TYPE ray_grpc_server_req_process_time_ms gauge
ray_grpc_server_req_process_time_ms{Component="gcs_server",Method="NodeResourceInfoGcsService.grpc_server.GetResources",NodeAddress="127.0.0.1",Version="1.13.0"} 0.088355
...

There is an example of application-level metrics in the above doc. However, it is intrusive to the application code and may be difficult to measure Ray-related performance accurately (e.g., actor/task start-up latency).

Ray perf

I also take a look at the ray microbenchmark in ray/ray_perf.py at master · ray-project/ray · GitHub. It shows that the ray_perf.py measures the latencies by directly timeit outside.

rickyyx · September 27, 2022, 1:38am

Hey @yzs thanks for the question.

Are there any existing metrics including job-related Ray API performance? For example, actor/task call latency, latency and size of arguments of ray.get/put ?

For the actor/task call latency: not yet, but we are working on those right now! The ETA for this to land in the nightly should be in weeks (definitely by the 2.2 release scheduled in 2 months)

For the ray.get and ray.put or other ray APIs: we currently don’t have explicit items on the roadmap but open to adopt. However, will profiling with ray work for your usecase here?

Topic		Replies	Views
Network I/O monitoring per ray job/task level Dashboard, Monitoring & Debugging	4	198	February 28, 2024
Some system metrics are unavailable Dashboard, Monitoring & Debugging	1	8	May 26, 2025
How to programmatically track tasks execution metadata/stats? Ray Core	2	281	May 6, 2023
Monitoring hardware utilization of workers Dashboard, Monitoring & Debugging	7	997	March 8, 2022
How to collect the resources usage in job level? Dashboard, Monitoring & Debugging	2	550	August 21, 2023

How to tracing Ray API performance by Ray exporter?

Problem

Potential solutions

Ray metrics

Ray perf

Related topics