Why cannot we use Ray Serve for offline batch services?

nemocbb · June 1, 2024, 2:59am

As the official doc suggests, it is better to use Ray Data for offline batch inference while Ray Serve is mostly for online inferences. However, for a data processing pipeline that involves both CPU and GPU operations, why cannot I serve the GPU operations with Ray Serve separately to take requests from CPU operations? What would be the downside of such design?

It seems by putting model inferences dedicated to a ray serve cluster, the GPU utilization can be very high if CPU requests keep coming in and saturate the GPUs.

The context being that we want to always keep the GPU utilization as high as possible, where CPUs are relatively cheap so we can always try to saturate GPUs with more CPUs to process the upstream tasks for GPUs.

Sam_Chan · June 3, 2024, 4:28am

Virtual Clusters will be able to service this type of scenario; having mixed offline/online jobs to better saturate utilization across a single Ray Cluster.

See this REP here with more details of the work: [REP] Virtual Cluster by jjyao · Pull Request #49 · ray-project/enhancements · GitHub; we’re planning to start work on this new capability sometime this year.

If you want to do this today the recommended path is to have two Ray Clusters on your Compute substrate (ie: your EC2/GCP fleet and/or your K8s clusters) and let the Ray Scheduler/Autoscaler automatically scale in/out the Ray Cluster to meet your separate offline and online inference requirements.

However this still doesn’t let you share a single physical node across two different jobs types; it is quite sophisticated to make changes to the Autoscaler and Scheduler (not to mention make sure the Ray GCS and Object Store can keep tabs on both types of Jobs across a single physical node reliably). That will come with the Virtual Cluster feature.

nemocbb · June 4, 2024, 2:35pm

Thanks Sam, I understand that virtual cluster could separate the workloads. Though my question here more touches on the comparison between Ray Data and Ray Serve.

Many use cases for Ray Data (e.g. , offline batch inference) involve AI inference. Let’s say a typical Ray Data job involves a CPU task and GPU task. What if replacing those GPU tasks (i.e. ,the AI inferences) all with Ray Serve, after CPU tasks, just directly invoke the Ray Serve APIs. In this case, I could have many CPU tasks running in parallel, as long as Ray Serve can still handle the requests, in that case the GPUs will always be saturated?

Sam_Chan · June 4, 2024, 5:11pm

On the same Ray Cluster or on two Clusters?

nemocbb · June 11, 2024, 3:02pm

On the same Ray Cluster. Basically many Ray Serve instances are deployed on the same Ray Cluster, which are used to replace the AI inference parts included in a typical Ray Data pipeline.

liuxsh9 · June 12, 2024, 7:48am

Here is an example of using Ray data + Ray serve for offline inference, which may help us discuss the requirements. The potential benefits include:

Models are resident on the devices, eliminating the time required for model initialization/gc at the job level.
Different jobs can share the model inference service, maximizing device(GPU/NPU) utilization.

Ray Serve

from ray import serve

@serve.deployment
class ObjectDetectionModel:
    def __init__(self):
        """Define the model loading and initialization code"""

    def __call__(self, input_batch):
        predictions = self.model(input_batch)
        return predictions

serve.run(ObjectDetectionModel.bind())

Ray Data

import ray
from ray import serve

def preprocess_image(data):
    """preporcess images"""

class ObjectDetectionInference:
    def __init__(self):
        """init serve handle"""
        self.handle = serve.get_deployment_handle("ObjectDetectionModel", "default")
        # enable local prefer routing
        self.handle = self.handle.options(_prefer_local_routing=True)

    def __call__(self, input_batch):
        """use serve handle to process data"""
        result= self.handle.remote(input_batch).result()
        return result

ds = ray.data.read_images("/input_path")
ds = ds.map(preprocess_image,)
ds = ds.map_batches(ObjectDetectionInference,)

for item in ds.iter_rows():
    pass

Topic		Replies	Views
Optimizing Real-Time ML Model Serving with Ray Serve on AWS GPU Cluster: Best Practices and Resource Allocation Strategies Ray Data	0	217	April 18, 2024
Ray Serve Model Worker Replicas Created But GPU Usage is 0% during Inference Ray Serve	7	984	January 19, 2022
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	1000	January 13, 2022
Can Ray Dataset facilitate training on heterogeneous clusters? Ray Data	6	1117	December 26, 2022
[Serve] Is it possible to serve a model without running a cluster Ray Serve	2	352	August 14, 2024

Why cannot we use Ray Serve for offline batch services?

Related topics