Working with large data that do not fit on the disks of a cluster

How severe does this issue affect your experience of using Ray?

  • Low: Just curious


My usecase is the following:
I work on the Azure Cloud.
I have a ML model that is executed on a GPU. I am loading the (video and image) data that I need to train the model via http using internal code.

I am using an actor-pool to prefetch & prepare the data for training (augmentations…) so the GPU can be kept busy most of the time.
Since the GPU nodes I have access to do not have enough CPUs I have an extra node with many CPUs to run that workload.

Since the data-fetching via http is quite slow, I also cache the data to Disk. By default, in the cluster disks only have a size of 150 GB it seems. This is not configurable.
Also, I would like to share the cache between all nodes, e.g. via a shared disk.

I have tried mounting blob-storage, however that is very slow.
In Azure, data disks can only be attached to a small number of VMs (max 3-10) → I cannot mount a data disk to all my VMs.

My questions are:

  • Does my approach for fetching and caching data make sense with ray?
  • How can I configure disk sizes?
  • How can I share a large cache between all nodes?

I feel like my understanding of how to use ray with large amounts of data is off, so a clarification how this is supposed to be done in ray would be appreciated!

Thank you!

1 Like

@gramhagen hey, do you have any insight on how this user can better leverage Azure?

@M_S have you tried using Ray Datasets? I think it should fit your use case well.

As Cade mentioned above, Ray Dataset is built for this type of problem and we’re working on Ray AI Runtime to facilitate distributed data IO + processing + ingest into trainers. You can try it out on our nightly wheels with docs: Setting up Data Ingest — Ray 3.0.0.dev0

Hi @cade, @Jiao_Dong

I did look into ray datasets a bit but not extensively.
I think in terms of fetching speed there would be no difference to what I am doing now, since I would just use ray-datasets as a wrapper around my fetching via http and then applying transforms.
Or is there another big advantage of datasets I am missing here (except of course that ray internally probably has nicer APIs when working with datasets)? Are they better at distributing the data loading to different nodes than a worker pool that is designed to do just that is?

Also, can you comment on my point about video/image data and caching those somehow on disks?
Is there a way to use bigger disks in the VMs? Currently I have to fetch them via the network every time but I would prefer to have them cached on disk during training.


I think there are 3 options for expanding capacity.

  1. Mount a file share as a network drive to existing nodes: Mount SMB Azure file share on Linux | Microsoft Docs
  2. Add data disks to exiting nodes: Attach a data disk to a Linux VM - Azure Virtual Machines | Microsoft Docs
  3. Modify azure-vm template and rebuild ray wheel, then use new wheel during deployment.
    Adding “diskSizeGB: XX” to this property (ray/azure-vm-template.json at e4a4f7de70e0105d94ff732215e252da3765e1fc · ray-project/ray · GitHub) will let you set the desired os disk size. Or add a section for a data disk, see reference here: Microsoft.Compute/virtualMachines 2018-10-01 - Bicep & ARM template reference | Microsoft Docs

1 and 2 are easy fixes, but require manual steps after cluster deployment and don’t really work well for autoscaling, so you would need to fix the min and max nodes of the cluster.
3 is a bit more complicated and ideally there would be a way to pass parameters from the ray yaml down to vm creation, but there’s no way to do that today.

1 Like

I’ve moved this over to the Ray Data category, hopefully @Jiao_Dong et al can give you what you need!