Data exchange between workers

phi32 · July 28, 2021, 2:16pm

Hi,

I need some advice on how I can structure data exchange between workers.
My project has one task that creates lots of coordinates and a second task that evaluates a function (and does some other stuff) with the coordinate and returns the results to some owner that processes the results.
The time both tasks take will vary over the execution of the program.
I want to execute both tasks at the same time so that coordinates are produced, put in some place and whenever task two is free it can access a new coordinate to evaluate.
What would be the best data structure to share data between the two tasks so they can work independently and in parallel? How do I make it accessible to both tasks?
Is this possible with ray or do I have to use some other framework/library?

Thank you very much.

sangcho · July 28, 2021, 5:54pm

Would you want this to be done in a streaming manner? or the evaluate will always be executed “after” producing a batch of coordinates is done?

phi32 · July 29, 2021, 6:20am

Ideally, this would happen in an streaming manner, otherwise the waiting time for a batch might become very long.

ericl · August 3, 2021, 8:10pm

Is this something datasets can implement? Datasets: Distributed Arrow on Ray — Ray v2.0.0.dev0

For example (not sure how coordinates are produced, but say we have a function to compute the ith coordinate get_coord(i)):

ds = ray.data.range(num_coords)
ds = ds.map(get_coord)
ds = ds.map(evaluate_coord)

To perform this in a streaming manner, you can use the dataset pipelining feature:
pipe = ray.data.range(num_coords).pipeline(parallelism=10)
pipe = pipe.map(get_coord)
pipe = pipe.map(evaluate_coord)

Topic		Replies	Views
Streaming data between tasks or actors Ray Core	1	288	May 6, 2022
Ray Datasets and Shell Tasks Ray Data	3	468	August 4, 2022
Is it possible to share objects between different driver processes? Ray Core	1	634	July 22, 2022
Passing large binary files and directories between tasks Ray Data	1	586	October 21, 2022
Recommended way to parallelize ray.get() calls to the driver (to pipeline Dataloader) Ray Core	2	331	April 26, 2021

Data exchange between workers

Related topics