Working with a large dataset


I have gotten significant speedups using ray for my task but I have been facing some difficulties with some new approaches. To summarise I have task A which has many subtasks B. Initially I parallelized the tasks in B while keeping A sequential. I found a way to perform B as part of a matrix and so was trying to parallelise A instead. To complete a single task in A, it takes less than 10 seconds, even milliseconds but after parallelizing, it took 30 seconds on average. The issue is that A requires reading from a large database (>10GB) which is why I think there is a slowdown. I already store the data using ray.put() and pass on the references to my function where I recreate the arrays using ray.get(). I was trying to instead store the references in a json file and load it in each function but the ObjectRef is not serializable and ray.get() requires the ObjectRef object and cannot work with the string of reference. Is there any way to do this or any other ideas that I can use. Please let me know if I have been too vague in describing the problem.

I think this approach should work. It may be slow if the format of the data isn’t something zero-copy deserializable. For example, a numpy array can be deserialized (ray.get) without copy overhead. However, a Python dict cannot.

@ericl So I actually am working with a dictionary of arrays. I tried to .put() the individual arrays instead of the dictionary this time hoping for improvement in deserialization but the issue now is that .put() itself is very time-consuming.