Code best practice (or lib?) for fastest parallel downloads

there is a multitude of libs in python for downloading data, and many ‘best practices’ exist wrt multi-threading, multi-processing, chunking, range reads, streaming, I/O vs num of vcpus etc. . While being a basic functionality but given its pervasive applicability, i think it would be very helpful in the context of ray to provide some sort of guidance (ideally via a code snippet, or maybe even some sort of Ray-provided lib?) on the fastest way to download files and process them within Ray. In this case, i would define “fastest” as the one that achieves the highest utilization of the vNIC.

WDYT? Does something like this exist and I’m maybe looking in the wrong place?

We don’t support downloading kind of feature explicitly right now. From my understanding, downloading actually has different use cases in different cases.

  • To download data from remote node
    This is done by ray core automatically in object store. Basically whenever when you pass object ref around and deref it, it’ll be downloaded automatically.
  • To sync working dir to remote node
    We have runtime env for this which you can have a try (Advanced Usage — Ray v1.6.0)
  • To load data
    We have dataset to load data (Dataset API Reference — Ray v1.6.0)

thx a lot for the feedback.

One quick question re the use of the dataset api for loading data: Would that data be loaded from more than one machine (i.e. ideally from all machines) , so that the vNICs of all VMs can pull in data in a highly parallel fashion?