Assume I have a directory called data/ with many files numbering over 1K. Total size of the directory is about 20-30 GB. Object store memory is 50GB.
I am attempting to read the data (pytorch tensors) in parallel from disk and put it in the object store.
@ray.remote def read_file(filepath): object_ref = ray.put(torch.load(filepath)) return object_ref def read_data(filepaths): read_file_refs =  object_refs = ray.get( [read_file_refs.append(read_file.remote(filepath)) for filepath in filepaths] ) # This runs very fast to put objects in plasma object store. data = ray.get(object_refs) # takes a lot of time ( I am looking to build 1 big dataset)
My questions are:
- What is a better way or the most optimal way to read a lot data from the object store?
- Am I doing this properly/optimally?