Why Ray Serve only just use half numbers of replicas for parallelism

liu_meteorfall · January 12, 2023, 5:34am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi Team,
I have a problem when using Ray.Serve for my api deployment.
My computer has 8 physical CPU cores and 16 logical processors,
When I assigned 4 replicas by

@serve.deployment(num_replicas=4)

And the dashboard shows that 6 process are working (2 of all are used by ray).

But in request test by postman, I found that Ray can only process 2 requests parallelly. for exam, I send 3 requests at the same time, only 2 processors are working at the same time, another 2 are idle.

When I updated num_replicas to 6, only 3 processors are working, anthors 3 are idle. It seems that I can only use half of the CPUs that I config.
Is there any mistake in my config or code?

PS: I set the OMP_NUM_THREADS to match the number of parallelism,but it doesn’t work.

@serve.deployment(num_replicas=4,ray_actor_options={"num_cpus":1,"num_gpus":0})
# @serve.deployment
@serve.ingress(app)
class CutOptimize:
    def __init__(self):
        os.environ["OMP_NUM_THREADS"]="4"

liu_meteorfall · January 12, 2023, 7:28am

After testing, it seems that caused by FastAPI.
By using fastAPI to code, the CPUs must be double to complete one task.
But using raw __ call __(), it works good.

def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}

Sihan_Wang · January 23, 2023, 5:56pm

Hi @liu_meteorfall, glad you fix the issue by your own! but do you mind sharing a simple script to reproduce the issue? (the team can help you diagnose more)

liu_meteorfall · February 10, 2023, 9:01am

1.simple case of using raw call by ray serve:

import ray
from ray import serve
from fastapi import FastAPI,Form
import os
import time

app = FastAPI()
origins = [
    "*"
]

ray.init(address="auto",_node_ip_address='192.168.11.5', namespace="serve")
serve.start(http_options={"host":"0.0.0.0"},detached=True)


@serve.deployment(num_replicas=2)
# @serve.deployment
@serve.ingress(app)
class TestFastApi:
    def __init__(self):
        pass

    @app.get("/")
    async def test(self):
        startTime = time.time()
        receiveTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(startTime))
        pid = os.getpid()
        try:
            time.sleep(5)
            # for i in range(150000):
            #     print(i)

            return {
                        'receiveTime':receiveTime,
                        'runTime':time.time()-startTime,
                        'infoTrace':"successs",
                        'pid':pid
                }

        except Exception as e:
            return {
                        'receiveTime':receiveTime,
                        'runTime':time.time()-startTime,
                        'errorTrace':e,
                        'pid':pid
                }

TestFastApi.deploy()

simple case of using FastApi by ray serve:

import ray
from ray import serve
import os
import time
from starlette.requests import Request

ray.init(address="auto",_node_ip_address='192.168.11.5', namespace="serve")
serve.start(http_options={"host":"0.0.0.0"},detached=True)


@serve.deployment(num_replicas=2)
class TestApi:
    def __init__(self):
        pass
    
    async def __call__(self, request: Request) -> str:

        try:
            #params: str = await request.json()
            startTime = time.time()
            receiveTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(startTime))
            pid = os.getpid()
        
            #api 参数

            
            time.sleep(5)
            # for i in range(150000):
            #     print(i)

            return {
                        'receiveTime':receiveTime,
                        'runTime':time.time()-startTime,
                        'infoTrace':"successs",
                        'pid':pid,
                }

        except Exception as e:
            return {
                        'receiveTime':receiveTime,
                        'runTime':time.time()-startTime,
                        'infoTrace':"successs",
                        'pid':pid,
                }

TestApi.deploy()

We use “time.sleep(5)” to simulate the long time working and assign 2 CPUs for parallelism.
deploy the above 2 to server for comparing the time results
We use postman to send two request for raw case api in a very short time, the result is:

image1380×620 75.5 KB

image1500×603 85.9 KB

The 2 CPUs is working parallelly as expect
We use postman to send two request for fastapi case in a very short time, the result is:

image1391×640 75.7 KB

image1499×627 88.9 KB
Is there any mistake in my fastapi case code?

liu_meteorfall · February 10, 2023, 9:08am

By the way, when I trace the log of ray serve, it seems that with fastapi, server will use one process to handle the request, but response with another process, so it may occupy 2 CPUs in one request.
I am not sure about my guess.

Topic		Replies	Views
Not sure how num_replicas works Ray Serve	5	1704	March 4, 2021
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	975	January 13, 2022
Ray Serve not distributing load to all replicas equally Ray Serve	3	49	June 20, 2025
Why there is no possibility to call more than 100 requests in parallel to Ray Serve? Ray Serve	4	256	January 10, 2024
Ray Serve Parallelism Python GIL vs Java Ray Serve	2	1068	August 5, 2021

Why Ray Serve only just use half numbers of replicas for parallelism

Related topics