Hello Ray team,
I am looking for a way to get number of queries assigned per replica.
I work on a set of inference models being called through http routes and .remote
from Locust and ad-hoc scripts to simulate effects of users activity on different configurations (number of replicas, queue depth, etc.).
I am trying to supplement my load graphs with per-replica utilization ones coming from Ray (they are nice to look at and helps to eyeball issues)
I was not able to find a suitable API and so far the solution that worked for me was to inject code that calls my ‘request counter’ agent from serve.router.ReplicaSet._try_assign_replica
to increment per replica counters and in my actual handler to decrement this counter after the task is done.
Changing the library code for this seems like an iffy solution, and monkey patching is not an option as http handler uses a separate ASGI process to which (AFAIK) I have no access (well, in addition to every other reason why MP is a bad idea).
The docs mention serve_replica_queued_queries system metric (is it actually called backend_queued_queries_total?) but I am on the Windows, so no dashboard or metrics export for me currently I tried to access collected metric directly via API but also ran into issues.
Any recommendations?