How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Problem
Are there any existing metrics including job-related Ray API performance? For example, actor/task call latency, latency and size of arguments of ray.get/put?
If not, are there any other approaches where this information is captured for offline analysis, such as in the output log of raylet or gcs?
My original purpose is to profiling all Ray API performance in an application-agnostic way.
Potential solutions
Ray metrics
Ray exports detailed and informative metrics in Ray metrics doc. The default exported metrics include latencies of core components in Ray, like
...
# HELP ray_grpc_server_req_finished_total Finished request number in grpc server
# TYPE ray_grpc_server_req_finished_total counter
ray_grpc_server_req_finished_total{Component="core_worker",Method="CoreWorkerService.grpc_server.GetCoreWorkerStats",NodeAddress="127.0.0.1",Version="1.13.0"} 484.0
# HELP ray_grpc_server_req_process_time_ms Request latency in grpc server
# TYPE ray_grpc_server_req_process_time_ms gauge
ray_grpc_server_req_process_time_ms{Component="gcs_server",Method="NodeResourceInfoGcsService.grpc_server.GetResources",NodeAddress="127.0.0.1",Version="1.13.0"} 0.088355
...
There is an example of application-level metrics in the above doc. However, it is intrusive to the application code and may be difficult to measure Ray-related performance accurately (e.g., actor/task start-up latency).
Ray perf
I also take a look at the ray microbenchmark in ray/python/ray/_private/ray_perf.py at master · ray-project/ray · GitHub. It shows that the ray_perf.py measures the latencies by directly timeit outside.