Want advice on Improving Ray for Long Machine Learning Model Training

Hello Ray Community :slightly_smiling_face:

I am working on a machine learning project that involves training large scale models, and I have decided to leverage Ray to help manage the distributed computing aspects. I have read through the Ray documentation. I am getting into some problems and have a few inquiries where the knowledge of the community can be helpful.

This is an overview of my project.
ML Frameworks= Using TensorFlow and PyTorch for different parts of the project.
Dataset Size= Approximately 1TB of training data.
Cluster Setup= AWS EC2 instances with a mix of CPU and GPU nodes.
Ray Version= 2.0.0

What are the best practices for configuring a Ray cluster for large scale ML model training? & i want advice on optimal resource allocation and node types; How can I insure fault tolerance throughout extended training sessions? Ray Tune is a tool that interests me to use for hyperparameter optimization & What the most important things to have in mind when organizing & running large-scale parameter tuning experiments; What tactics can be used to improve Ray’s distributed training performance? What specific Ray settings or features are best suited for machine learning workloads :thinking:

Thank you in advance for your help. I want to learn from the communitys experiences.
Best regards,
Esme

TPM @ Anyscale here

Couple of things:

  • What type is your data, structrued/unstructured, images, text, in what format (parquet, pandas df etc)?
  • EC2 setup is good; can you share what GPU types
  • Ray 2.0.0 is > close to 2YO at this point, any reason why you’re not using latest Ray?
  • For right-sizing your resources see here: Configuring Scale and GPUs — Ray 2.32.0
  • Tuning start here: User Guides — Ray 2.32.0 < tons of good user guides; we recently did a good refresh of everything
  • Performance is a very loaded question - I would start with one of the guides and as you ramp up let us know on what specific questions (eg: you’re finding that startup time on additional trials are slow for a given worker type or something; we can troubleshoot from there)

If you’re facing trouble on getting infra underlying Ray to be configured correctly as well (although EC2 is pretty vanilla; lots of people in the community have gotten this working) you can consider trialing out the hosted Ray on Anyscale to devtest first as well.

Just go to anyscale.com > click “Try for Free”.