[Data] Async functions in map_batches

codedecde · November 14, 2024, 11:32pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

A project that I’m working on requires in essence to read urls from a parquet table, download them in bulk, process them and then upload the processing in bulk. For that, I wanted to use asyncio for the download and upload in bulk part in a batch.
However, within map_batches, calling an async function doesn’t seem like an option, while running using asyncio.run seems to cause a hang. Would be grateful for any advice on how to go about doing this, or if this is just not possible.

Thank you !

mowen · November 18, 2024, 7:11pm

There should be support for aysnchronous execution (more on that in this PR), but it sounds like there might be an easier solution for what you are trying to do. One of the core features of Ray Data is that it does streaming execution (rather than bulk execution). The native way to do an operation like this would be to use some combination of read_parquet, map_batches and write_parquet.

This will mean that as files are downloaded / memory becomes available it will begin processing the data and when the data is done processing the writing will begin. If you are using Ray Data to manage this, then there should not be a need to write async functions within map_batches.

Topic		Replies	Views
Aync & Wait/Get for Datasets Ray Data	1	825	December 7, 2021
[Data] map_batches is not respecting concurrency from the beginning Ray Libraries (Data, Train, Tune, Serve)	1	71	December 6, 2024
Hanging issue with serve.batch	2	321	December 22, 2023
Long Running Aynch Job Ray Serve	4	1037	March 29, 2022
Run ray dataset.map_batch in ray task Ray Client	0	16	November 27, 2024

[Data] Async functions in map_batches

Related topics