Want advice on Improving Ray for Long Machine Learning Model Training

Esme · July 13, 2024, 4:13am

Hello Ray Community

I am working on a machine learning project that involves training large scale models, and I have decided to leverage Ray to help manage the distributed computing aspects. I have read through the Ray documentation. I am getting into some problems and have a few inquiries where the knowledge of the community can be helpful.

This is an overview of my project.
ML Frameworks= Using TensorFlow and PyTorch for different parts of the project.
Dataset Size= Approximately 1TB of training data.
Cluster Setup= AWS EC2 instances with a mix of CPU and GPU nodes.
Ray Version= 2.0.0

What are the best practices for configuring a Ray cluster for large scale ML model training? & i want advice on optimal resource allocation and node types; How can I insure fault tolerance throughout extended training sessions? Ray Tune is a tool that interests me to use for hyperparameter optimization & What the most important things to have in mind when organizing & running large-scale parameter tuning experiments; What tactics can be used to improve Ray’s distributed training performance? What specific Ray settings or features are best suited for machine learning workloads

Thank you in advance for your help. I want to learn from the communitys experiences.
Best regards,
Esme

Sam_Chan · July 13, 2024, 6:53am

TPM @ Anyscale here

Couple of things:

What type is your data, structrued/unstructured, images, text, in what format (parquet, pandas df etc)?
EC2 setup is good; can you share what GPU types
Ray 2.0.0 is > close to 2YO at this point, any reason why you’re not using latest Ray?
For right-sizing your resources see here: Configuring Scale and GPUs — Ray 2.32.0
Tuning start here: User Guides — Ray 2.32.0 < tons of good user guides; we recently did a good refresh of everything
Performance is a very loaded question - I would start with one of the guides and as you ramp up let us know on what specific questions (eg: you’re finding that startup time on additional trials are slow for a given worker type or something; we can troubleshoot from there)

If you’re facing trouble on getting infra underlying Ray to be configured correctly as well (although EC2 is pretty vanilla; lots of people in the community have gotten this working) you can consider trialing out the hosted Ray on Anyscale to devtest first as well.

Just go to anyscale.com > click “Try for Free”.

Topic		Replies	Views
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	15	December 18, 2024
Best Practices for Optimizing Ray Tune Trials RLlib	2	19	June 19, 2025
Guidance Needed: Best Practices for Scaling Ray Workloads in a Hybrid Cluster Setup	0	13	May 28, 2025
Accessing Large Static Datasets with Ray Clusters	3	558	May 27, 2023
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	34	December 18, 2024

Want advice on Improving Ray for Long Machine Learning Model Training

Related topics