Code @Jules_Damji. Trying smaller batch right now too.
class CpuPreprocess:
def __init__(self):
TEXT_MODEL = "distilbert-base-multilingual-cased"
self.tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL, use_fast=True)
def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
<SOME list operations to get text>
tokens = self.tokenizer(
text, return_tensors="np", padding="longest", truncation=True, max_length=200
)
all_feats["input_ids"] = [tokens["input_ids"]]
all_feats["attention_mask"] = [tokens["attention_mask"]]
return all_feats
class Inference:
def __init__(self):
TEXT_MODEL = "distilbert-base-multilingual-cased"
model = <LOAD HUGGING FACE TEXT MODEL + SOME CUSTOM NN layers>
model = model.to("cuda")
model = torch.nn.DataParallel(model)
self.model = model
# Logic for inference on 1 batch of data.
def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
# start_time = time.time()
feats = {}
feats["input_ids"] = torch.from_numpy(batch["input_ids"][0]).to("cuda")
feats["attention_mask"] = torch.from_numpy(batch["attention_mask"][0]).to("cuda")
with torch.inference_mode():
predictions = self.model(feats)
return {
"predictions": predictions
}
If I use
class PreprocessAndInference:
def __init__(self):
self.preprocess = Preprocess()
self.inference = Inference()
def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
start_time = time.perf_counter()
batch = self.preprocess(batch)
infer = self.inference(batch)
return {'infer': infer}
as my actor then it works fine meaning a batch is immediately passed from preprocessing to inference
But this defeats the whole purpose of me using ray.
Since Cpu is expensive, I want to be able to make use of multiple cpus and do inference on gpus.
I want to avoid a smaller batch size because in the cpu preprocessing I want to be able to tokenize multiple rows at once so that I can utilize the full gpu. For a given batch, the length of the tokens need to be same for gpu inferencing and therefore, I cannot tokenize each row one by one and then take a batch of it for gpu inferencing.
NOTE: I’m using ray 2.6
EDIT:
Even with smaller batch size, seeing the same thing (gpu not being used, Inference
not in active tasks, object store mem keeps filling up)