Hello! I’m checking both htop and Ray Dashboard to monitor my VM resources and I found that ray::ServeReplica:my_module:MyDeployment.handle_request_with_rejection is using a lot RAM and CPU capacity. I’m using serve to deploy some models.
I can’t find many references about this process and what it means, but searching for handle_request_with_rejection in Ray codebase I found that it’s related to max_ongoing_requests. See here.
I’m not using autoscaling since this replica holds a PyTorch model in GPU, so there’s just a single replica for this Deployment. Should I change max_ongoing_requests? I’m using Ray 2.10.
How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi @Augusto_Maillo this is totally normal, it’s just the name of the actor method that Serve uses when executing a request. It will call your handler method or the FastAPI app.
Isn’t it a problem? Can increase actor queue size help me?
Edit: This actor run a torch model in gpu, so it is naturally heavy. It’s difficult to me to differ which loads come from model execution or bad using of Ray.
If they are torch threads, shouldn’t they have the same PID?
Edit: My bad. htop PID is not always process id. With TGID I can see that are threads from the same process. Thank you for the help.