Code best practice (or lib?) for fastest parallel downloads

mbehrendt · September 29, 2021, 6:58am

there is a multitude of libs in python for downloading data, and many ‘best practices’ exist wrt multi-threading, multi-processing, chunking, range reads, streaming, I/O vs num of vcpus etc. . While being a basic functionality but given its pervasive applicability, i think it would be very helpful in the context of ray to provide some sort of guidance (ideally via a code snippet, or maybe even some sort of Ray-provided lib?) on the fastest way to download files and process them within Ray. In this case, i would define “fastest” as the one that achieves the highest utilization of the vNIC.

WDYT? Does something like this exist and I’m maybe looking in the wrong place?

yic · October 1, 2021, 8:41pm

Hi,
We don’t support downloading kind of feature explicitly right now. From my understanding, downloading actually has different use cases in different cases.

To download data from remote node
This is done by ray core automatically in object store. Basically whenever when you pass object ref around and deref it, it’ll be downloaded automatically.
To sync working dir to remote node
We have runtime env for this which you can have a try (Advanced Usage — Ray v1.6.0)
To load data
We have dataset to load data (Dataset API Reference — Ray v1.6.0)

mbehrendt · December 1, 2021, 6:53pm

thx a lot for the feedback.

One quick question re the use of the dataset api for loading data: Would that data be loaded from more than one machine (i.e. ideally from all machines) , so that the vNICs of all VMs can pull in data in a highly parallel fashion?

Topic		Replies	Views
Working with a large dataset Ray Core	2	1080	December 16, 2021
Recommended way to parallelize ray.get() calls to the driver (to pipeline Dataloader) Ray Core	2	325	April 26, 2021
Aync & Wait/Get for Datasets Ray Data	1	848	December 7, 2021
Reading Data in parallel from file and pushing to the plasma object store Ray Core	4	977	April 1, 2021
Please suggest good pipeline architecture Ray Data	1	365	October 12, 2022

Code best practice (or lib?) for fastest parallel downloads

Related topics