How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Problem
Are there any existing metrics including job-related Ray API performance? For example, actor/task call latency, latency and size of arguments of ray.get/put
?
If not, are there any other approaches where this information is captured for offline analysis, such as in the output log of raylet
or gcs
?
My original purpose is to profiling all Ray API performance in an application-agnostic way.
Potential solutions
Ray metrics
Ray exports detailed and informative metrics in Ray metrics doc. The default exported metrics include latencies of core components in Ray, like
...
# HELP ray_grpc_server_req_finished_total Finished request number in grpc server
# TYPE ray_grpc_server_req_finished_total counter
ray_grpc_server_req_finished_total{Component="core_worker",Method="CoreWorkerService.grpc_server.GetCoreWorkerStats",NodeAddress="127.0.0.1",Version="1.13.0"} 484.0
# HELP ray_grpc_server_req_process_time_ms Request latency in grpc server
# TYPE ray_grpc_server_req_process_time_ms gauge
ray_grpc_server_req_process_time_ms{Component="gcs_server",Method="NodeResourceInfoGcsService.grpc_server.GetResources",NodeAddress="127.0.0.1",Version="1.13.0"} 0.088355
...
There is an example of application-level metrics in the above doc. However, it is intrusive to the application code and may be difficult to measure Ray-related performance accurately (e.g., actor/task start-up latency).
Ray perf
I also take a look at the ray microbenchmark
in ray/ray_perf.py at master · ray-project/ray · GitHub. It shows that the ray_perf.py
measures the latencies by directly timeit
outside.