How to tracing Ray API performance by Ray exporter?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Problem

Are there any existing metrics including job-related Ray API performance? For example, actor/task call latency, latency and size of arguments of ray.get/put?

If not, are there any other approaches where this information is captured for offline analysis, such as in the output log of raylet or gcs?

My original purpose is to profiling all Ray API performance in an application-agnostic way.

Potential solutions

Ray metrics

Ray exports detailed and informative metrics in Ray metrics doc. The default exported metrics include latencies of core components in Ray, like

...
# HELP ray_grpc_server_req_finished_total Finished request number in grpc server
# TYPE ray_grpc_server_req_finished_total counter
ray_grpc_server_req_finished_total{Component="core_worker",Method="CoreWorkerService.grpc_server.GetCoreWorkerStats",NodeAddress="127.0.0.1",Version="1.13.0"} 484.0
# HELP ray_grpc_server_req_process_time_ms Request latency in grpc server
# TYPE ray_grpc_server_req_process_time_ms gauge
ray_grpc_server_req_process_time_ms{Component="gcs_server",Method="NodeResourceInfoGcsService.grpc_server.GetResources",NodeAddress="127.0.0.1",Version="1.13.0"} 0.088355
...

There is an example of application-level metrics in the above doc. However, it is intrusive to the application code and may be difficult to measure Ray-related performance accurately (e.g., actor/task start-up latency).

Ray perf

I also take a look at the ray microbenchmark in ray/ray_perf.py at master · ray-project/ray · GitHub. It shows that the ray_perf.py measures the latencies by directly timeit outside.

Hey @yzs thanks for the question.

Are there any existing metrics including job-related Ray API performance? For example, actor/task call latency, latency and size of arguments of ray.get/put ?

For the actor/task call latency: not yet, but we are working on those right now! The ETA for this to land in the nightly should be in weeks (definitely by the 2.2 release scheduled in 2 months)

For the ray.get and ray.put or other ray APIs: we currently don’t have explicit items on the roadmap but open to adopt. However, will profiling with ray work for your usecase here?