Hello Ray Community ![]()
I am working on a machine learning project that involves training large scale models, and I have decided to leverage Ray to help manage the distributed computing aspects. I have read through the Ray documentation. I am getting into some problems and have a few inquiries where the knowledge of the community can be helpful.
This is an overview of my project.
ML Frameworks= Using TensorFlow and PyTorch for different parts of the project.
Dataset Size= Approximately 1TB of training data.
Cluster Setup= AWS EC2 instances with a mix of CPU and GPU nodes.
Ray Version= 2.0.0
What are the best practices for configuring a Ray cluster for large scale ML model training? & i want advice on optimal resource allocation and node types; How can I insure fault tolerance throughout extended training sessions? Ray Tune is a tool that interests me to use for hyperparameter optimization & What the most important things to have in mind when organizing & running large-scale parameter tuning experiments; What tactics can be used to improve Ray’s distributed training performance? What specific Ray settings or features are best suited for machine learning workloads ![]()
Thank you in advance for your help. I want to learn from the communitys experiences.
Best regards,
Esme