Hey Team!
I’m experimenting with Ray Serve to see if I can parallelize my compute time for model predictions. I’m hoping someone can confirm/deny my assumptions and/or point me in the right direction.
My Tests
Currently I have simple code that loads up the zero-shot-classification model and a compute function that uses it, but when I run the program I see the round-robin aspect of ray serve in that replica 1 will run compute, then replica 2, then replica 1 essentially processing my array one-by-one.
However when i looked at the [Batching Tutorial — Ray 2.3.1](https://Batching Tutorial) I saw that they make use of a @remote function to make the calls asynchronously. Is this the way to do it?
Relevant code
@serve.deployment(route_prefix="/", num_replicas=2)
class ZeroshotDeployment:
def __init__(self):
start_time = time.time()
# Define the zero-shot classification pipeline
self.model = pipeline('zero-shot-classification', model="../models")
print("End Init: ", time.time() - start_time)
def __call__(self, request):
labels = request.query_params["labels"].split(",")
sentence = request.query_params["sentence"]
print("RECIEVED: ", sentence)
return self.model(sentence, labels, hypothesis_template='This text is about {}', multi_label=True)
serve.run(ZeroshotDeployment.bind())
# Create a faker instance
fake = Faker()
# Generate a list of sentences
sentences = [fake.sentence() for _ in range(10)]
# Create an empty array with a specified length
arr_length = len(sentences)
arr = [None] * arr_length
print(arr_length)
# Fill the array with the sentences
for i in range(arr_length):
arr[i] = sentences[i]
candidate_labels = ["car", "house", "phone", "teamwork", "passion", "group", "things"]
modified_lables = (','.join(candidate_labels))
@ray.remote
def send_query(sentence, labels):
resp = requests.get(
"http://localhost:8000/", params={"sentence": sentence, "labels": labels}
).json()
return resp
start_time = time.time()
# for sentences in arr:
# res = send_query.remote(sentences, modified_lables)
result_batch = ray.get([send_query.remote(sentence, modified_lables) for sentence in arr])
print("Time: ", {time.time() - start_time})
Goal
I have a hugging face model defined in ray serve along with 2 replicas spun up. When i send requests to the serve cluster I want both replicas running compute to theoretically cut time in half for computing a list of n length. By extension if I have 3 replicas then I would get some factor of ‘x’ speed up on computing a list.
Thanks!