Performance of first real remote call

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When analyzing flow with opentelemetry trace we noticed that the first real remote call to actor takes a long time (700 ms) (we have a remote call before just to understand that the actor is up)
The image shows two flows, the first is the the first call, the second image is from one of the following spans.
We see around 2 such calls per batch of new actors. Can’t understand any reason except that we have 2 threads that call remote actor functions

Our driver is outside the cluster

I need to understand what this /ray.rpc.rayletdriver/getobject to understand it won’t affect the overall performance.

Span data:

Span name:

/ray.rpc.rayletdriver/getobject

Tags

otel.library.name opentelemetry.instrumentation.grpc
otel.library.version 0.38b0
otel.scope.name opentelemetry.instrumentation.grpc
otel.scope.version 0.38b0
rpc.grpc.status_code 0
rpc.method GetObject
rpc.service ray.rpc.RayletDriver
rpc.system grpc
service.name driver process name
telemetry.auto.version 0.38b0
telemetry.sdk.language python
telemetry.sdk.name opentelemetry
telemetry.sdk.version 1.17.0

Thanks

Hey @shiranbi - thanks for the great question. Appreciate the details there.

My current guess is that the first call with high latency you observed might be due to actor still being created, or the local client being setup on the first invocation.

Is the first remote call still slow if:

  1. You wait for sometimes so that the actor has enough time to be created asynchronously on the client side. Or
  2. You made a dummy remote call to run a task.

Hi @rickyyx
This is the first real call and not the first actual remote call
the first one is a dummy remote call to make sure that the actor is done creating (which was my first guess when I started to debug it)

I would like to breakpoint debug this but I am unable to understand where to put the breakpoint. Maybe you can point me to it?

Thanks