Hello Ray Community
I am working on a machine learning project that involves training large scale models, and I have decided to leverage Ray to help manage the distributed computing aspects. I have read through the Ray documentation. I am getting into some problems and have a few inquiries where the knowledge of the community can be helpful.
This is an overview of my project.
ML Frameworks= Using TensorFlow and PyTorch for different parts of the project.
Dataset Size= Approximately 1TB of training data.
Cluster Setup= AWS EC2 instances with a mix of CPU and GPU nodes.
Ray Version= 2.0.0
What are the best practices for configuring a Ray cluster for large scale ML model training? & i want advice on optimal resource allocation and node types; How can I insure fault tolerance throughout extended training sessions? Ray Tune is a tool that interests me to use for hyperparameter optimization & What the most important things to have in mind when organizing & running large-scale parameter tuning experiments; What tactics can be used to improve Ray’s distributed training performance? What specific Ray settings or features are best suited for machine learning workloads
Thank you in advance for your help. I want to learn from the communitys experiences.
Best regards,
Esme