Reading Data in parallel from file and pushing to the plasma object store

Javier_Bosch · April 1, 2021, 6:38pm

Assume I have a directory called data/ with many files numbering over 1K. Total size of the directory is about 20-30 GB. Object store memory is 50GB.

I am attempting to read the data (pytorch tensors) in parallel from disk and put it in the object store.

e.g.

@ray.remote
def read_file(filepath):
    object_ref = ray.put(torch.load(filepath))
    return object_ref


def read_data(filepaths):
    read_file_refs = []

object_refs = ray.get(
    [read_file_refs.append(read_file.remote(filepath)) for filepath in filepaths]
) # This runs very fast to put objects in plasma object store. 

data = ray.get(object_refs) # takes a lot of time ( I am looking to build 1 big dataset)

My questions are:

What is a better way or the most optimal way to read a lot data from the object store?
Am I doing this properly/optimally?

sangcho · April 1, 2021, 6:42pm

Are you purely reading files or doing some map-reduce like operations on top of it?

Javier_Bosch · April 1, 2021, 6:45pm

For now, I am purely reading in the files. The intention is to do some processing on it later, but I want to also aggregate the data back to be fed into a DataLoader, batched and resample.

I am trying to find a faster way to read in the data.

sangcho · April 1, 2021, 7:08pm

If you are reading the data, the current pattern will invoke several copies. For example;

when you first read using torch.load, it copies data to memory/
When you run ray.put, it copies to the shared memory.
Lastly, when you run ray.get(object_refs), it copies data back to your driver (zero copy read is supported only for numpy array with float & integer, so the contents of your file will be copied to your driver process memory)

To optimize, there are multiple things you can do.

If you will process this data, and you’d like to use zero-copy read, you need to do some pre-processing to convert your file contetns to numpy or pandas datafram (which uses float & integer data type).
Otherwise, I just recommend you to use multiple tasks that read partitions from your files directly there (which will minimize the copy cost).

sangcho · April 1, 2021, 7:10pm

Also, cc @yncxcw He has been doing bunch of data loading stuff using Ray, so it could probably give you some feedback!

Topic		Replies	Views
Ray/Plasma backed array	15	1238	March 8, 2021
Putting objects to plasma Ray Core	4	364	February 2, 2021
Ray object store Ray Core	2	783	April 1, 2022
Loading pyarrow.Table directly into ObjectStore from Parquet Ray Core	1	255	November 20, 2023
Plasma store APIs Ray Core	5	2664	December 16, 2022

Reading Data in parallel from file and pushing to the plasma object store

Related topics