Ray Serve Parallelism Python GIL vs Java

Hello, I have a few questions regarding how Ray Serve handles threading and parallelism.

The documentation clearly states

Neither the Threaded Actors nor AsyncIO for Actors model will allow you to bypass the GIL.

Source: Ray Concurrency For Actors API Documentation

  • Question 1: When Ray Serve run multiple replicas of an Actor (eg: one per cpu core), does it mean that it is not running in parallel but only asynchronously?

  • Question 2: Given that Ray Serve cannot bypass the GIL for Python, is it correct to assume that we can achieve better performance using Ray Serve Java API? (Important question for us as we come from a Java/Kotlin background)

  • Question 3 : When work is performed on the GPU, is Ray Serve still limited by Python GIL?

Thanks

@Alexandre_Brown_md really good questions! The GIL is definitely a common pain point / confusion in Python in general :slight_smile:

The short answer is that Ray Serve (and Ray in general) relies on process-level parallelism to get around the GIL. Each of your replicas in Ray Serve maps to a Ray actor, which runs in its own process across the cluster. That means that they indeed can run in parallel and saturate multiple CPUs/GPUs.

You can see this by running a simple experiment:

import os

import ray
from ray import serve

serve.start()

@serve.deployment(num_replicas=2)
class Test:
    def __call__(self, *args):
        return os.getpid()

Test.deploy()
handle = Test.get_handle()
while True:
    import time
    print("Got PID:", ray.get(handle.remote()))
    time.sleep(0.5)

Output:

Got PID: 78255
Got PID: 78253
Got PID: 78255
Got PID: 78253
Got PID: 78255
Got PID: 78253

If you run the above, you’ll see that two different PIDs are returned, meaning that different requests are handled by different processes. If you added a sleep in the handler, you could also verify that they’re running in parallel.

Note that this means that within each replica there is no parallelism by default. However, you can use Python’s asyncio to achieve concurrency within a replica.

For example, if I send 5 parallel requests to a deployment that each sleep for 1 second, without asyncio they will be handled in serial and this will take around 5 seconds:

import time

import ray
from ray import serve

serve.start()

@serve.deployment
class Test:
    def __call__(self, *args):
        time.sleep(1)
        return "hello world!"

Test.deploy()
handle = Test.get_handle()

start = time.time()
ray.get([handle.remote() for _ in range(5)]) # Send 5 requests concurrently.
print("5 requests took:", time.time()-start) # This should take ~5 seconds.

Output:
5 requests took: 5.031394958496094

However, with asyncio we can handle these concurrently if we change to asyncio.sleep instead of time.sleep. In reality, this sleep operation would probably be some kind of IO like reading a large file from S3 (it wouldn’t be compute-intensive work like model inference). Our modified example should return much more quickly:

import asyncio
import time

import ray
from ray import serve

serve.start()

@serve.deployment
class Test:
    async def __call__(self, *args):
        await asyncio.sleep(1)
        return "hello world!"

Test.deploy()
handle = Test.get_handle()

start = time.time()
ray.get([handle.remote() for _ in range(5)]) # Send 5 requests concurrently.
print("5 requests took:", time.time()-start) # This should take a little over 1 second.

Output:
5 requests took: 1.0138587951660156

Of course, you could also have a deployment using both multiple replicas and asyncio, in which case you could achieve both parallelism across replicas and concurrency within the replicas.

Hope this is helpful! :slight_smile:

4 Likes

Thanks a lot @eoakes , the answer and the code examples were extremely clear.

Cheers