Why Ray Serve only just use half numbers of replicas for parallelism

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Team,
I have a problem when using Ray.Serve for my api deployment.
My computer has 8 physical CPU cores and 16 logical processors,
When I assigned 4 replicas by

@serve.deployment(num_replicas=4)

And the dashboard shows that 6 process are working (2 of all are used by ray).

But in request test by postman, I found that Ray can only process 2 requests parallelly. for exam, I send 3 requests at the same time, only 2 processors are working at the same time, another 2 are idle.

When I updated num_replicas to 6, only 3 processors are working, anthors 3 are idle. It seems that I can only use half of the CPUs that I config.
Is there any mistake in my config or code?

PS: I set the OMP_NUM_THREADS to match the number of parallelism,but it doesn’t work.

@serve.deployment(num_replicas=4,ray_actor_options={"num_cpus":1,"num_gpus":0})
# @serve.deployment
@serve.ingress(app)
class CutOptimize:
    def __init__(self):
        os.environ["OMP_NUM_THREADS"]="4"

After testing, it seems that caused by FastAPI.
By using fastAPI to code, the CPUs must be double to complete one task.
But using raw __ call __(), it works good.

def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}
1 Like

Hi @liu_meteorfall, glad you fix the issue by your own! but do you mind sharing a simple script to reproduce the issue? (the team can help you diagnose more)