I’ve known about Ray for a long time, but never used it before. I’m coming from a different orchestration system (KFP) and I wanted to know whether something I’m used to is possible with Ray.
KFP orchestrates arbitrary containerized command-line programs and helps pass arbitrary data from outputs to inputs (often by mounting the specific data files/directories).
Ray does great work serializing data and working with many dataset formats. But what if my data is different and I need something lower-level?
Imagine that I have a function that converts a bunch of video files. And another function that trains an ML model on the transformed videos.
So, I need a way for a task to produce a big (say 100GB) binary multi-file dataset that cannot fit into memory in a format that Ray does not understand. Then pass this dataset to another task.
What would be the bets way to do this?
Should I use Ray’s data storage or should I avoid it for such cases?
P.S. With KFP I just write a function and mark certain parameters as input/output paths and the system does everything related to passing raw data for me (mount the input data, store the output data) during the distributed execution.:
def filter_text(
text_path: InputPath(),
filtered_text_path: OutputPath(),
pattern: str,
):
import re
with open(text_path, 'r') as reader:
with open(filtered_text_path, 'w') as writer:
for line in reader:
if re.search(pattern, line):
writer.write(line)