[Data] Async functions in map_batches

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

A project that I’m working on requires in essence to read urls from a parquet table, download them in bulk, process them and then upload the processing in bulk. For that, I wanted to use asyncio for the download and upload in bulk part in a batch.
However, within map_batches, calling an async function doesn’t seem like an option, while running using asyncio.run seems to cause a hang. Would be grateful for any advice on how to go about doing this, or if this is just not possible.

Thank you !

There should be support for aysnchronous execution (more on that in this PR), but it sounds like there might be an easier solution for what you are trying to do. One of the core features of Ray Data is that it does streaming execution (rather than bulk execution). The native way to do an operation like this would be to use some combination of read_parquet, map_batches and write_parquet.

This will mean that as files are downloaded / memory becomes available it will begin processing the data and when the data is done processing the writing will begin. If you are using Ray Data to manage this, then there should not be a need to write async functions within map_batches.