@Alexandre_Brown_md really good questions! The GIL is definitely a common pain point / confusion in Python in general
The short answer is that Ray Serve (and Ray in general) relies on process-level parallelism to get around the GIL. Each of your replicas in Ray Serve maps to a Ray actor, which runs in its own process across the cluster. That means that they indeed can run in parallel and saturate multiple CPUs/GPUs.
You can see this by running a simple experiment:
import os
import ray
from ray import serve
serve.start()
@serve.deployment(num_replicas=2)
class Test:
def __call__(self, *args):
return os.getpid()
Test.deploy()
handle = Test.get_handle()
while True:
import time
print("Got PID:", ray.get(handle.remote()))
time.sleep(0.5)
Output:
Got PID: 78255
Got PID: 78253
Got PID: 78255
Got PID: 78253
Got PID: 78255
Got PID: 78253
If you run the above, you’ll see that two different PIDs are returned, meaning that different requests are handled by different processes. If you added a sleep in the handler, you could also verify that they’re running in parallel.
Note that this means that within each replica there is no parallelism by default. However, you can use Python’s asyncio
to achieve concurrency within a replica.
For example, if I send 5 parallel requests to a deployment that each sleep for 1 second, without asyncio
they will be handled in serial and this will take around 5 seconds:
import time
import ray
from ray import serve
serve.start()
@serve.deployment
class Test:
def __call__(self, *args):
time.sleep(1)
return "hello world!"
Test.deploy()
handle = Test.get_handle()
start = time.time()
ray.get([handle.remote() for _ in range(5)]) # Send 5 requests concurrently.
print("5 requests took:", time.time()-start) # This should take ~5 seconds.
Output:
5 requests took: 5.031394958496094
However, with asyncio
we can handle these concurrently if we change to asyncio.sleep
instead of time.sleep
. In reality, this sleep
operation would probably be some kind of IO like reading a large file from S3 (it wouldn’t be compute-intensive work like model inference). Our modified example should return much more quickly:
import asyncio
import time
import ray
from ray import serve
serve.start()
@serve.deployment
class Test:
async def __call__(self, *args):
await asyncio.sleep(1)
return "hello world!"
Test.deploy()
handle = Test.get_handle()
start = time.time()
ray.get([handle.remote() for _ in range(5)]) # Send 5 requests concurrently.
print("5 requests took:", time.time()-start) # This should take a little over 1 second.
Output:
5 requests took: 1.0138587951660156
Of course, you could also have a deployment using both multiple replicas and asyncio
, in which case you could achieve both parallelism across replicas and concurrency within the replicas.
Hope this is helpful!