@Jules_Damji, this is my first time using Ray. I’m attempting to convert a PyTorch training script to function in a distributed manner on a Ray Cluster. I aim to keep code alterations to a minimum during the transition from a single laptop to a Ray cluster.
I tried to follow the example for the Transformers and PyTorch libraries:
https://docs.ray.io/en/latest/train/examples/transformers/transformers_example.html
https://docs.ray.io/en/latest/train/examples/pytorch/torch_fashion_mnist_example.html
At first glance, only the training part needs to use a ray.train.torch.torch_trainer.TorchTrainer
and wrap around DataLoaders with ray.train.torch.prepare_data_loader
.
However, regarding the dataset, it has become somewhat complex. The dataset is stored on a local disk and composed of multiple audio files and a metadata file, taking up a few gigabytes of disk space, which is not too excessive.
A naive approach to running the script as it is leads to an error of missing files. So, I tried to rsync files directly to nodes using file_mounts in ray_cluster.yaml
:
file_mounts:
~/src/: src/
# The following line rsyncs all files to nodes.
~/data/: data/
~/poetry.lock: poetry.lock
~/poetry.toml: poetry.toml
~/pyproject.toml: pyproject.toml
~/requirements.txt: requirements.txt
I haven’t succeeded with this, but it still seemed incorrect to me.
Another option that you suggested is to use Object Storage. However, it might have performance issues since it requires processing numerous object references. Keeping the dataset in the Object Storage might be convenient only in the case of a single file. Still, I believe such an approach might have other drawbacks if the dataset becomes more extensive, and we would like to productionize our code. Although I generally prefer the Ray Client approach for its interactive feel, I have decided to use Ray Jobs as it is more recommended.
So, the next step is to keep all files in S3-like storage. It looks like the only feasible way to store the dataset in distributed cases.
Here are the conclusions. Reading Ray’s documentation, I need help finding an explicit recommendation on how to deal with such cases. However, most of the prototypes usually are done starting with locally stored datasets. Therefore, to begin, it would be great to generally advise to first transfer the dataset to S3, starting with Ray. Also, “Ray Client” and “Ray Job” concepts might be described in more detail by providing examples. Although most examples rely on Ray Client, the recommended one is Ray Job and Ray Client is suggested for experts only.