- High: It blocks me to complete my task.
Hello,
I am launching a multi replica ray cluster with kubernetes to support a RayTune
job. Specifically, I am using TorchTrainer
to wrap a lightning
module.
Currently, my lightning module downloads my training data from an S3 bucket to a directory named based off the worker_id
of the ray worker. However, this means that I am using up extra memory downloading multiple instances of my dataset for each ray worker on the node.
I am wondering what the best practices are for this scenario? Is there some sort of “pre tune” hook I can run on each kubernetes replica that will download the dataset to one directory all my trials can access?
appreciate the help!