Understanding differences in performance for Ray.remote vs Ray Serve

I’m working on a small research project to investigate differences in performance for similar workloads on Ray Serve vs simply scheduling ray.remote tasks.

I have a very simple function that evaluates mathematical functions, and I generate a few random functions that force computation (recursive fibonacci and factorial functions).

The code in Ray Serve looks roughly like this (please disregard the obvious security vulnerability :):

@serve.deployment(num_replicas=8)
def evaluar(request):
    def fib(n):
        if n<= 1:
            return n
        else: 
            return fib(n-1)+ fib(n-2)
    def fac(n):
        res = 1
        for i in range (1,n+1):
            res*= i
        return res
    op = request.url.path.split("evaluar/")[-1]
    return eval(op)

evaluar.deploy()

async def loadtest_async(num_operaciones=200):
    async with aiohttp.ClientSession() as session:
        for _ in range(num_operaciones):
            op = generate_operation()
            deploy_url = f'http://127.0.0.1:8000/evaluar/{op}'
            async with session.get(deploy_url) as resp:
                deploy_url = await resp.json()

asyncio.run(loadtest_async())

I run a similar workload, except that the remote calls are done directly using ray.remote calls, instead of scheduling the computation with Ray Serve. Something like this:

@ray.remote
def evaluador(op):
    def fib(n):
        if n<= 1:
            return n
        else: 
            return fib(n-1)+ fib(n-2)
    def fac(n):
        res = 1
        for i in range (1,n+1):
            res*= i
        return res
    return eval(op)

ray.get([ray_evaluador.remote(generate_operation()) for _ in range(200)])

When running each workloadin ray, I see that the CPU usage is very different for each workload:

  • The first workload (with Ray Serve) reaches CPU usage average of less than 10%
  • The second workload (direct Ray Remote calls) reaches very high CPU usage above 90% on all cores

What could explain this difference?

Note: I’m running this on my laptop with 8 threads.

Hmm, there’s a few factors that might explain the difference–

  • Each Serve deployment replica is actually an entire actor. In other words, evaluar is converted into 8 different actors with @serve.deployment, but it’s converted into a single task with @ray.remote. Additionally, each request is round-robin routed to each replica. That means when 8 requests are sent, each replica handles a different request. This may be lowering the throughput.
  • The code takes advantage of ray.get() batching in the Ray example by calling ray.get() on a list of remote task calls. However, it’s not using Serve batching. Serve batching may help close the performance gap.
  • The Serve code incurs overhead from HTTP. Serve deploys an HTTP Proxy actor when it starts. When an HTTP request is issued, the request must go through the actor which then makes remote request to the evaluar deployment. Using ServeHandles to make the initial call (instead of an HTTP request) may help close the performance gap.
  • The 8 threads may also be interfering with the Ray Serve performance, but I’m not super sure about that.

Some suggestions for modifying this benchmark:

  • Try reducing the number of threads and/or replicas in Ray Serve to see if that increases utilization.
  • Try out Ray Serve batching. Keep in mind that only requests from different clients are batched, so you may need to issue requests in multiple separate threads.
  • Try sending the initial Ray Serve requests using Ray Serve handles instead of HTTP requests.

I found this and couldn’t initially get the code to work to try and reproduce, but found the answer very worthwhile by itself. I simplified the problem to testing a simplified fibonocci method in 3 different scenarios: remote Task, remote Actor and then was implementing the Serve version(which is an Actor) when I realized the problem, the loadtest_async() function is sending the requests serially, which explains why the CPU usage is only at 10%.