Reading Data in parallel from file and pushing to the plasma object store

Assume I have a directory called data/ with many files numbering over 1K. Total size of the directory is about 20-30 GB. Object store memory is 50GB.

I am attempting to read the data (pytorch tensors) in parallel from disk and put it in the object store.

e.g.

@ray.remote
def read_file(filepath):
    object_ref = ray.put(torch.load(filepath))
    return object_ref


def read_data(filepaths):
    read_file_refs = []

object_refs = ray.get(
    [read_file_refs.append(read_file.remote(filepath)) for filepath in filepaths]
) # This runs very fast to put objects in plasma object store. 

data = ray.get(object_refs) # takes a lot of time ( I am looking to build 1 big dataset) 

My questions are:

  1. What is a better way or the most optimal way to read a lot data from the object store?
  2. Am I doing this properly/optimally?

Are you purely reading files or doing some map-reduce like operations on top of it?

For now, I am purely reading in the files. The intention is to do some processing on it later, but I want to also aggregate the data back to be fed into a DataLoader, batched and resample.

I am trying to find a faster way to read in the data.

If you are reading the data, the current pattern will invoke several copies. For example;

  1. when you first read using torch.load, it copies data to memory/
  2. When you run ray.put, it copies to the shared memory.
  3. Lastly, when you run ray.get(object_refs), it copies data back to your driver (zero copy read is supported only for numpy array with float & integer, so the contents of your file will be copied to your driver process memory)

To optimize, there are multiple things you can do.

  1. If you will process this data, and you’d like to use zero-copy read, you need to do some pre-processing to convert your file contetns to numpy or pandas datafram (which uses float & integer data type).
  2. Otherwise, I just recommend you to use multiple tasks that read partitions from your files directly there (which will minimize the copy cost).

Also, cc @yncxcw He has been doing bunch of data loading stuff using Ray, so it could probably give you some feedback!