As the official doc suggests, it is better to use Ray Data for offline batch inference while Ray Serve is mostly for online inferences. However, for a data processing pipeline that involves both CPU and GPU operations, why cannot I serve the GPU operations with Ray Serve separately to take requests from CPU operations? What would be the downside of such design?
It seems by putting model inferences dedicated to a ray serve cluster, the GPU utilization can be very high if CPU requests keep coming in and saturate the GPUs.
The context being that we want to always keep the GPU utilization as high as possible, where CPUs are relatively cheap so we can always try to saturate GPUs with more CPUs to process the upstream tasks for GPUs.
Virtual Clusters will be able to service this type of scenario; having mixed offline/online jobs to better saturate utilization across a single Ray Cluster.
See this REP here with more details of the work: [REP] Virtual Cluster by jjyao · Pull Request #49 · ray-project/enhancements · GitHub; we’re planning to start work on this new capability sometime this year.
If you want to do this today the recommended path is to have two Ray Clusters on your Compute substrate (ie: your EC2/GCP fleet and/or your K8s clusters) and let the Ray Scheduler/Autoscaler automatically scale in/out the Ray Cluster to meet your separate offline and online inference requirements.
However this still doesn’t let you share a single physical node across two different jobs types; it is quite sophisticated to make changes to the Autoscaler and Scheduler (not to mention make sure the Ray GCS and Object Store can keep tabs on both types of Jobs across a single physical node reliably). That will come with the Virtual Cluster feature.
1 Like
Thanks Sam, I understand that virtual cluster could separate the workloads. Though my question here more touches on the comparison between Ray Data and Ray Serve.
Many use cases for Ray Data (e.g. , offline batch inference) involve AI inference. Let’s say a typical Ray Data job involves a CPU task and GPU task. What if replacing those GPU tasks (i.e. ,the AI inferences) all with Ray Serve, after CPU tasks, just directly invoke the Ray Serve APIs. In this case, I could have many CPU tasks running in parallel, as long as Ray Serve can still handle the requests, in that case the GPUs will always be saturated?
On the same Ray Cluster or on two Clusters?
On the same Ray Cluster. Basically many Ray Serve instances are deployed on the same Ray Cluster, which are used to replace the AI inference parts included in a typical Ray Data pipeline.
1 Like
Here is an example of using Ray data + Ray serve for offline inference, which may help us discuss the requirements. The potential benefits include:
- Models are resident on the devices, eliminating the time required for model initialization/gc at the job level.
- Different jobs can share the model inference service, maximizing device(GPU/NPU) utilization.
Ray Serve
from ray import serve
@serve.deployment
class ObjectDetectionModel:
def __init__(self):
"""Define the model loading and initialization code"""
def __call__(self, input_batch):
predictions = self.model(input_batch)
return predictions
serve.run(ObjectDetectionModel.bind())
Ray Data
import ray
from ray import serve
def preprocess_image(data):
"""preporcess images"""
class ObjectDetectionInference:
def __init__(self):
"""init serve handle"""
self.handle = serve.get_deployment_handle("ObjectDetectionModel", "default")
# enable local prefer routing
self.handle = self.handle.options(_prefer_local_routing=True)
def __call__(self, input_batch):
"""use serve handle to process data"""
result= self.handle.remote(input_batch).result()
return result
ds = ray.data.read_images("/input_path")
ds = ds.map(preprocess_image,)
ds = ds.map_batches(ObjectDetectionInference,)
for item in ds.iter_rows():
pass
1 Like