Ray suitability for unsplittable, non-numeric data processing

anderson900 · September 20, 2021, 6:19pm

Hi, I work with an existing home-grown, Python-based distributed processing pipeline that performs various security-focused analyses of files, both plain text and binary. Typically, the files must be processed in their entirety and so don’t fit well into a DataFrame or map-reduce-type processing paradigm. I’ve recently been investigating Ray as a possible foundation for re-designing the pipeline even though the use case may be a bit of a stretch.

I’m curious if there is a way to achieve the zero-copy operation seen with numpy arrays in Ray, but with python bytes/bytearray instead? It seems like the lower-level Arrow Plasmastore API may work well for having multiple processes accessing a single copy of a files contents via memoryview, but that at the Ray get/put level, some of that capability seems to be lost as the more advanced serialization features are gained.

Does anyone have any advice for how Ray might be used efficiently (zero-copy) for unsplittable distributed file processing?

Thanks!

Topic		Replies	Views
Please suggest good pipeline architecture Ray Data	1	367	October 12, 2022
Ray object store Ray Core	2	994	April 1, 2022
Zero-copy deserialization with recursive dictionaries/lists Ray Core	1	636	August 3, 2021
Ray/Plasma backed array	15	1251	March 8, 2021
Ray Datasets and Shell Tasks Ray Data	3	461	August 4, 2022

Ray suitability for unsplittable, non-numeric data processing

Related topics