Utilising Ray for Simple Parallelism (Batch Inference)

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hey all :wave::smile:,

I was wondering if Ray would be a good fit for a problem I’m trying to solve.

Context
Currently I have a container that is running n number processes that have the exact same python code. In said python code I load up a big hugging face model from disk and run some computations on it. After the computations are ran the model is kept in memory for performance sake on subsequent calls (takes ~11 seconds to load up the model).

I want to improve the performance even more by running those computations in parallel as computation 1 doesn’t effect computation 2. I’ve been looking at some parallelism libraries and chose Ray due to the fact that the current environment this container runs in is susceptible to OOM’s and because Ray has some nice out-of-the-box features / retry mechanisms / can gracefully handle OOMs etc… it seemed like a good choice.

Goals

  1. I want the Ray Cluster running locally inside of the container
  2. I want something inside the cluster that persists with the model loaded into memory, such that I can call it n number of times from an input list (Its okay if the model isnt loaded up on the first call, I can suffer the time it takes for it load up, but subsequent calls should not be loading it up again)
  3. I want each process isolated such that if process 1 gets OOM’d it doesnt affect process n

Stretch Goal
Have it so that my n number of processes can point towards whatever is in the Ray cluster and share that memory where the model is located, instead of loading up n number of models for n number of processes.

The Solution?
Now do forgive me because I’m writing this after working on this problem for 10 hours and my brain is mush :sweat_smile:, but I have been playing around with named actors that on init would theoretically load up that model and then also have some form a compute function.

For example…

import ray
import time

# Assumes we've ran `ray start --head` in the container
ray.init(address="auto", namespace="testZero")


@ray.remote
class ZeroTest:
    def __init__(self):
        print("Init Model")
        time.sleep(10)
        self.model = 2
    def compute(self, x):
        time.sleep(20)
        return self.model * x
    
Model = ZeroTest.options(name="test", get_if_exists=True, lifetime="detached", max_concurrency=100).remote()

array = [1,2,3,4,5]
result=[]
for data in array:
    chunk = Model.compute.remote(data)
    result.append(chunk)
print(ray.get(result))

I believe this would address most of my issues as having a named actor would allow my n number of processes running the same code to hook up to whats inside the ray cluster. Furthermore, having the lifetime set to detached would also allow this actor to persist even in the event that all my processes die. However, because this is my first time working with Ray im not sure if I’m missing something as If i simulated a SIGKILL in the compute function while two processes ran then I believe both would die, so for cases like that I’m not sure if theres something out-of-the-box or If i have to move somethings around to avoid the issue where if process 1 gets SIGKILL’d on compute it only effects process 1 and process 2 can continue.

Thanks again!

I believe you’re looking for something like Batch Inference –

Here is an example of using Batch Inference with Hugging Face – GPT-J-6B Batch Prediction with Ray AIR — Ray 3.0.0.dev0

Let me know if it makes sense/captures what you’re looking for!