Why Ray Serve only just use half numbers of replicas for parallelism

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Team,
I have a problem when using Ray.Serve for my api deployment.
My computer has 8 physical CPU cores and 16 logical processors,
When I assigned 4 replicas by

@serve.deployment(num_replicas=4)

And the dashboard shows that 6 process are working (2 of all are used by ray).

But in request test by postman, I found that Ray can only process 2 requests parallelly. for exam, I send 3 requests at the same time, only 2 processors are working at the same time, another 2 are idle.

When I updated num_replicas to 6, only 3 processors are working, anthors 3 are idle. It seems that I can only use half of the CPUs that I config.
Is there any mistake in my config or code?

PS: I set the OMP_NUM_THREADS to match the number of parallelism,but it doesn’t work.

@serve.deployment(num_replicas=4,ray_actor_options={"num_cpus":1,"num_gpus":0})
# @serve.deployment
@serve.ingress(app)
class CutOptimize:
    def __init__(self):
        os.environ["OMP_NUM_THREADS"]="4"

After testing, it seems that caused by FastAPI.
By using fastAPI to code, the CPUs must be double to complete one task.
But using raw __ call __(), it works good.

def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}
1 Like

Hi @liu_meteorfall, glad you fix the issue by your own! but do you mind sharing a simple script to reproduce the issue? (the team can help you diagnose more)

1.simple case of using raw call by ray serve:

import ray
from ray import serve
from fastapi import FastAPI,Form
import os
import time

app = FastAPI()
origins = [
    "*"
]

ray.init(address="auto",_node_ip_address='192.168.11.5', namespace="serve")
serve.start(http_options={"host":"0.0.0.0"},detached=True)


@serve.deployment(num_replicas=2)
# @serve.deployment
@serve.ingress(app)
class TestFastApi:
    def __init__(self):
        pass

    @app.get("/")
    async def test(self):
        startTime = time.time()
        receiveTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(startTime))
        pid = os.getpid()
        try:
            time.sleep(5)
            # for i in range(150000):
            #     print(i)

            return {
                        'receiveTime':receiveTime,
                        'runTime':time.time()-startTime,
                        'infoTrace':"successs",
                        'pid':pid
                }

        except Exception as e:
            return {
                        'receiveTime':receiveTime,
                        'runTime':time.time()-startTime,
                        'errorTrace':e,
                        'pid':pid
                }

TestFastApi.deploy()
  1. simple case of using FastApi by ray serve:
import ray
from ray import serve
import os
import time
from starlette.requests import Request

ray.init(address="auto",_node_ip_address='192.168.11.5', namespace="serve")
serve.start(http_options={"host":"0.0.0.0"},detached=True)


@serve.deployment(num_replicas=2)
class TestApi:
    def __init__(self):
        pass
    
    async def __call__(self, request: Request) -> str:

        try:
            #params: str = await request.json()
            startTime = time.time()
            receiveTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(startTime))
            pid = os.getpid()
        
            #api 参数

            
            time.sleep(5)
            # for i in range(150000):
            #     print(i)

            return {
                        'receiveTime':receiveTime,
                        'runTime':time.time()-startTime,
                        'infoTrace':"successs",
                        'pid':pid,
                }

        except Exception as e:
            return {
                        'receiveTime':receiveTime,
                        'runTime':time.time()-startTime,
                        'infoTrace':"successs",
                        'pid':pid,
                }

TestApi.deploy()
  1. We use “time.sleep(5)” to simulate the long time working and assign 2 CPUs for parallelism.
  2. deploy the above 2 to server for comparing the time results
  3. We use postman to send two request for raw case api in a very short time, the result is:


    The 2 CPUs is working parallelly as expect
  4. We use postman to send two request for fastapi case in a very short time, the result is:

  5. Is there any mistake in my fastapi case code?

By the way, when I trace the log of ray serve, it seems that with fastapi, server will use one process to handle the request, but response with another process, so it may occupy 2 CPUs in one request.
I am not sure about my guess.