Low througput and not able to scale with ray serve

1. Severity of the issue: (select one)

High: Completely blocks me.

2. Environment:

  • Ray version: 2.44.1
  • Python version: 3.10.13
  • OS: Ubuntu 22.04
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: To be able to scale and improve requests throughput
  • Actual: can not get any gains by scaling with ray server over simple fastapi application

I am trying to scale my application by using fastapi and ray serve. I am following the documentation, however I can not get any gains by scaling with ray server over simple fastapi application.
My throughput with ray serve is the same as using fastapi (even a little bit worse). Scaling with NUM_REPLICAS=2, does not improve throughput, when testing on my laptop. What am I missing here? I was expecting that using ray serve will improve throughput .
Code:
ray_server.py


@serve.deployment(
    num_replicas=os.environ.get("NUM_REPLICAS", 2),
    ray_actor_options={"num_cpus": os.environ.get("NUM_CPU", 1),
    "num_gpus": os.environ.get("NUM_GPU", 0)})
@serve.ingress(app)
class Service:
    def __init__(self):
        self.ml = MLService()

    async def predict(self, text: str):
        return await self.ml.predict(text)


    @app.post(
        "/extract
    )
    async def extract(self,
            content: str
    ):
        response = await self.predict(text=content)
        return ExtractionResult(mentions=response)

def deployment(_args):
    return Service.bind()

Deploying with:

serve run server.ray_server:deployment

If I understand what you are saying, when you increase the number of replicas from 1 to 2. Your requests per sec does not go from 50 rps to 100 rps ? or is it that compared to a equivalent fastapi only setup the delta is effectively zero?

If MLService.predict() is CPU-bound and not truly async (i.e., using blocking code), then requests maybe queuing one behind the other.

could you share some more logs that depict what you seeing.