Low througput and not able to scale with ray serve

fil-onto · April 23, 2025, 1:59pm

1. Severity of the issue: (select one)

High: Completely blocks me.

2. Environment:

Ray version: 2.44.1
Python version: 3.10.13
OS: Ubuntu 22.04
Cloud/Infrastructure:
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: To be able to scale and improve requests throughput
Actual: can not get any gains by scaling with ray server over simple fastapi application

I am trying to scale my application by using fastapi and ray serve. I am following the documentation, however I can not get any gains by scaling with ray server over simple fastapi application.
My throughput with ray serve is the same as using fastapi (even a little bit worse). Scaling with NUM_REPLICAS=2, does not improve throughput, when testing on my laptop. What am I missing here? I was expecting that using ray serve will improve throughput .
Code:
ray_server.py


@serve.deployment(
    num_replicas=os.environ.get("NUM_REPLICAS", 2),
    ray_actor_options={"num_cpus": os.environ.get("NUM_CPU", 1),
    "num_gpus": os.environ.get("NUM_GPU", 0)})
@serve.ingress(app)
class Service:
    def __init__(self):
        self.ml = MLService()

    async def predict(self, text: str):
        return await self.ml.predict(text)


    @app.post(
        "/extract
    )
    async def extract(self,
            content: str
    ):
        response = await self.predict(text=content)
        return ExtractionResult(mentions=response)

def deployment(_args):
    return Service.bind()

Deploying with:

serve run server.ray_server:deployment

abrarsheikh · May 6, 2025, 3:17am

If I understand what you are saying, when you increase the number of replicas from 1 to 2. Your requests per sec does not go from 50 rps to 100 rps ? or is it that compared to a equivalent fastapi only setup the delta is effectively zero?

If MLService.predict() is CPU-bound and not truly async (i.e., using blocking code), then requests maybe queuing one behind the other.

could you share some more logs that depict what you seeing.

Topic		Replies	Views
Why Ray Serve only just use half numbers of replicas for parallelism Ray Serve	4	659	February 10, 2023
Ray Serve with FastAPI slowing down performance Ray Serve	1	504	July 19, 2023
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	851	October 20, 2023
Ray Serve: Ray Serve vs Regular Web server Performance? Ray Serve	2	1265	January 5, 2022
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	978	January 13, 2022

Low througput and not able to scale with ray serve

Related topics