Recommended way to parallelize ray.get() calls to the driver (to pipeline Dataloader)

cpayne · April 26, 2021, 6:55am

Hello, thanks for building a great library!

I am not sure if this is possible, but basically I would like to have concurrent ray.get() calls running in a Dataloader so that multiple processes are bringing data from the object store onto the driver process at once while doing other work in a pipelined manner. The code looks a bit like this:

# Somwhere in driver code
refs = ray.put([my_big_data_A, my_big_data_B, my_big_data_C])

def ray_get_worker(ref):
    return ray.get(ref)

# Later on in Dataloader
def __next__():
    # I want to fetch all handles concurrently while doing other work is ongoing
    ray.get(refs) # too slow AND blocks

    p = mp.Pool(4)
    parallel_get = p.imap(ray_get_worker, refs) # doesn't block and in theory 
                                                  would have 4x workers (doesn't work though ;)

    # work that takes 500ms (I would like ray.get() to be 'running' during this time)

    # grab "ray.get()" results
    for result in parallel_get:
        # do something with local data

I read that Ray is faster than multiprocessing (which would have to serialize returning the large data structures to the driver) so understand that using multiprocessing here won’t work but am looking for an equivalent way to ray.get() multiple refs in a non blocking, synchronous manner, if such a thing exists!

asm582 · April 26, 2021, 4:25pm

Hello, if this helps, have you looked at Dask on ray? Dask on Ray — Ray v2.0.0.dev0

mannyv · April 26, 2021, 5:23pm

Hi @cpayne,

I am not sure I understand what exactly you want to do. Once you get a handle from a remote() call that task will run. You do not need to use ray.get to make the task run. You use it to get the results of the task. As you pointed out ray.get will block or timeout if you give it one until the task finishes and returns.

The part I am unsure about is if you want to:
a.) start some tasks, do some separate processing independent of the results of the first task, and then collect the previous tasks results? You can already do this by moving the get until after the 500ms part

or

b.) Start some tasks, collect results from tasks have finished, do some processing on them, collect the next batch of results from tasks that finished, process, until all tasks are done? If you want to do this type of workflow then you can use ray.wait instead of ray.get.

As @asm582 pointed out there may be a higher-level api like dask that is more appropriate for your needs.

Topic		Replies	Views
Feature request: Allow ray.wait() to do the necessary work for an instant ray.get() Ray Core	16	451	May 25, 2021
Aync & Wait/Get for Datasets Ray Data	1	866	December 7, 2021
Multiple requests to a single Actor Ray Core	5	455	September 29, 2021
Handling Exceptions from list of tasks using ray.get Ray Core	7	2144	June 17, 2021
Ray not utilized efficiently Ray Core	3	355	March 15, 2022

Recommended way to parallelize ray.get() calls to the driver (to pipeline Dataloader)

Related topics