How severe does this issue affect your experience of using Ray?
Low: Just curious
Hello!
My usecase is the following:
I work on the Azure Cloud.
I have a ML model that is executed on a GPU. I am loading the (video and image) data that I need to train the model via http using internal code.
I am using an actor-pool to prefetch & prepare the data for training (augmentations…) so the GPU can be kept busy most of the time.
Since the GPU nodes I have access to do not have enough CPUs I have an extra node with many CPUs to run that workload.
Since the data-fetching via http is quite slow, I also cache the data to Disk. By default, in the cluster disks only have a size of 150 GB it seems. This is not configurable.
Also, I would like to share the cache between all nodes, e.g. via a shared disk.
I have tried mounting blob-storage, however that is very slow.
In Azure, data disks can only be attached to a small number of VMs (max 3-10) → I cannot mount a data disk to all my VMs.
My questions are:
Does my approach for fetching and caching data make sense with ray?
How can I configure disk sizes?
How can I share a large cache between all nodes?
I feel like my understanding of how to use ray with large amounts of data is off, so a clarification how this is supposed to be done in ray would be appreciated!
As Cade mentioned above, Ray Dataset is built for this type of problem and we’re working on Ray AI Runtime to facilitate distributed data IO + processing + ingest into trainers. You can try it out on our nightly wheels with docs: Setting up Data Ingest — Ray 3.0.0.dev0
I did look into ray datasets a bit but not extensively.
I think in terms of fetching speed there would be no difference to what I am doing now, since I would just use ray-datasets as a wrapper around my fetching via http and then applying transforms.
Or is there another big advantage of datasets I am missing here (except of course that ray internally probably has nicer APIs when working with datasets)? Are they better at distributing the data loading to different nodes than a worker pool that is designed to do just that is?
Also, can you comment on my point about video/image data and caching those somehow on disks?
Is there a way to use bigger disks in the VMs? Currently I have to fetch them via the network every time but I would prefer to have them cached on disk during training.
1 and 2 are easy fixes, but require manual steps after cluster deployment and don’t really work well for autoscaling, so you would need to fix the min and max nodes of the cluster.
3 is a bit more complicated and ideally there would be a way to pass parameters from the ray yaml down to vm creation, but there’s no way to do that today.