Assume I have a directory called data/ with many files numbering over 1K. Total size of the directory is about 20-30 GB. Object store memory is 50GB.
I am attempting to read the data (pytorch tensors) in parallel from disk and put it in the object store.
e.g.
@ray.remote
def read_file(filepath):
object_ref = ray.put(torch.load(filepath))
return object_ref
def read_data(filepaths):
read_file_refs = []
object_refs = ray.get(
[read_file_refs.append(read_file.remote(filepath)) for filepath in filepaths]
) # This runs very fast to put objects in plasma object store.
data = ray.get(object_refs) # takes a lot of time ( I am looking to build 1 big dataset)
My questions are:
- What is a better way or the most optimal way to read a lot data from the object store?
- Am I doing this properly/optimally?
Are you purely reading files or doing some map-reduce like operations on top of it?
For now, I am purely reading in the files. The intention is to do some processing on it later, but I want to also aggregate the data back to be fed into a DataLoader, batched and resample.
I am trying to find a faster way to read in the data.
If you are reading the data, the current pattern will invoke several copies. For example;
- when you first read using
torch.load
, it copies data to memory/
- When you run ray.put, it copies to the shared memory.
- Lastly, when you run ray.get(object_refs), it copies data back to your driver (zero copy read is supported only for numpy array with float & integer, so the contents of your file will be copied to your driver process memory)
To optimize, there are multiple things you can do.
- If you will process this data, and you’d like to use zero-copy read, you need to do some pre-processing to convert your file contetns to numpy or pandas datafram (which uses float & integer data type).
- Otherwise, I just recommend you to use multiple tasks that read partitions from your files directly there (which will minimize the copy cost).
Also, cc @yncxcw He has been doing bunch of data loading stuff using Ray, so it could probably give you some feedback!