Why Ray Serve only just use half numbers of replicas for parallelism

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Team,
I have a problem when using Ray.Serve for my api deployment.
My computer has 8 physical CPU cores and 16 logical processors,
When I assigned 4 replicas by


And the dashboard shows that 6 process are working (2 of all are used by ray).

But in request test by postman, I found that Ray can only process 2 requests parallelly. for exam, I send 3 requests at the same time, only 2 processors are working at the same time, another 2 are idle.

When I updated num_replicas to 6, only 3 processors are working, anthors 3 are idle. It seems that I can only use half of the CPUs that I config.
Is there any mistake in my config or code?

PS: I set the OMP_NUM_THREADS to match the number of parallelism,but it doesn’t work.

# @serve.deployment
class CutOptimize:
    def __init__(self):

After testing, it seems that caused by FastAPI.
By using fastAPI to code, the CPUs must be double to complete one task.
But using raw __ call __(), it works good.

def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}
1 Like

Hi @liu_meteorfall, glad you fix the issue by your own! but do you mind sharing a simple script to reproduce the issue? (the team can help you diagnose more)