Why Ray Serve only just use half numbers of replicas for parallelism

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Team,
I have a problem when using Ray.Serve for my api deployment.
My computer has 8 physical CPU cores and 16 logical processors,
When I assigned 4 replicas by


And the dashboard shows that 6 process are working (2 of all are used by ray).

But in request test by postman, I found that Ray can only process 2 requests parallelly. for exam, I send 3 requests at the same time, only 2 processors are working at the same time, another 2 are idle.

When I updated num_replicas to 6, only 3 processors are working, anthors 3 are idle. It seems that I can only use half of the CPUs that I config.
Is there any mistake in my config or code?

PS: I set the OMP_NUM_THREADS to match the number of parallelism,but it doesn’t work.

# @serve.deployment
class CutOptimize:
    def __init__(self):

After testing, it seems that caused by FastAPI.
By using fastAPI to code, the CPUs must be double to complete one task.
But using raw __ call __(), it works good.

def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}
1 Like

Hi @liu_meteorfall, glad you fix the issue by your own! but do you mind sharing a simple script to reproduce the issue? (the team can help you diagnose more)

1.simple case of using raw call by ray serve:

import ray
from ray import serve
from fastapi import FastAPI,Form
import os
import time

app = FastAPI()
origins = [

ray.init(address="auto",_node_ip_address='', namespace="serve")

# @serve.deployment
class TestFastApi:
    def __init__(self):

    async def test(self):
        startTime = time.time()
        receiveTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(startTime))
        pid = os.getpid()
            # for i in range(150000):
            #     print(i)

            return {

        except Exception as e:
            return {

  1. simple case of using FastApi by ray serve:
import ray
from ray import serve
import os
import time
from starlette.requests import Request

ray.init(address="auto",_node_ip_address='', namespace="serve")

class TestApi:
    def __init__(self):
    async def __call__(self, request: Request) -> str:

            #params: str = await request.json()
            startTime = time.time()
            receiveTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(startTime))
            pid = os.getpid()
            #api 参数

            # for i in range(150000):
            #     print(i)

            return {

        except Exception as e:
            return {

  1. We use “time.sleep(5)” to simulate the long time working and assign 2 CPUs for parallelism.
  2. deploy the above 2 to server for comparing the time results
  3. We use postman to send two request for raw case api in a very short time, the result is:

    The 2 CPUs is working parallelly as expect
  4. We use postman to send two request for fastapi case in a very short time, the result is:

  5. Is there any mistake in my fastapi case code?

By the way, when I trace the log of ray serve, it seems that with fastapi, server will use one process to handle the request, but response with another process, so it may occupy 2 CPUs in one request.
I am not sure about my guess.