Hi, I work with an existing home-grown, Python-based distributed processing pipeline that performs various security-focused analyses of files, both plain text and binary. Typically, the files must be processed in their entirety and so don’t fit well into a DataFrame or map-reduce-type processing paradigm. I’ve recently been investigating Ray as a possible foundation for re-designing the pipeline even though the use case may be a bit of a stretch.
I’m curious if there is a way to achieve the zero-copy operation seen with numpy arrays in Ray, but with python bytes/bytearray instead? It seems like the lower-level Arrow Plasmastore API may work well for having multiple processes accessing a single copy of a files contents via memoryview, but that at the Ray get/put level, some of that capability seems to be lost as the more advanced serialization features are gained.
Does anyone have any advice for how Ray might be used efficiently (zero-copy) for unsplittable distributed file processing?