Concurrently Processing Requests w/ Ray Serve

Hey Team! :wave:

I’m experimenting with Ray Serve to see if I can parallelize my compute time for model predictions. I’m hoping someone can confirm/deny my assumptions and/or point me in the right direction.

My Tests
Currently I have simple code that loads up the zero-shot-classification model and a compute function that uses it, but when I run the program I see the round-robin aspect of ray serve in that replica 1 will run compute, then replica 2, then replica 1 essentially processing my array one-by-one.

However when i looked at the [Batching Tutorial — Ray 2.3.1](https://Batching Tutorial) I saw that they make use of a @remote function to make the calls asynchronously. Is this the way to do it?

Relevant code

@serve.deployment(route_prefix="/", num_replicas=2)
class ZeroshotDeployment:
    def __init__(self):
        start_time = time.time()
        # Define the zero-shot classification pipeline
        self.model = pipeline('zero-shot-classification', model="../models")
        print("End Init: ", time.time() - start_time)

    def __call__(self, request):
        labels = request.query_params["labels"].split(",")
        sentence = request.query_params["sentence"]
        print("RECIEVED: ", sentence)
        return self.model(sentence, labels, hypothesis_template='This text is about {}', multi_label=True)

# Create a faker instance
fake = Faker()

# Generate a list of sentences
sentences = [fake.sentence() for _ in range(10)]

# Create an empty array with a specified length
arr_length = len(sentences)
arr = [None] * arr_length


# Fill the array with the sentences
for i in range(arr_length):
    arr[i] = sentences[i]

candidate_labels = ["car", "house", "phone", "teamwork", "passion", "group", "things"]

modified_lables = (','.join(candidate_labels))

def send_query(sentence, labels):
    resp = requests.get(
        "http://localhost:8000/", params={"sentence": sentence, "labels": labels}
    return resp

start_time = time.time()
# for sentences in arr:
#     res = send_query.remote(sentences, modified_lables)
result_batch = ray.get([send_query.remote(sentence, modified_lables) for sentence in arr])
print("Time: ", {time.time() - start_time})

I have a hugging face model defined in ray serve along with 2 replicas spun up. When i send requests to the serve cluster I want both replicas running compute to theoretically cut time in half for computing a list of n length. By extension if I have 3 replicas then I would get some factor of ‘x’ speed up on computing a list.



Async allows the replicas to concurrently process requests. Replica work in the round-robin fashion, balancing requests.

Here’s one example similar to what you trying to do with GPT-J model for serving.

cc: @cindy_zhang