For questions on distributed model training. Use Train to scale a model’s training process across CPUs, GPUs, or multiple nodes.
Ray Train is a scalable machine learning library for distributed training and fine-tuning.
Ray Train allows you to scale model training code from a single machine to a cluster of machines in the cloud, and abstracts away the complexities of distributed computing. Whether you have large models or large datasets, Ray Train is the simplest solution for distributed training.
Ray Train provides support for many frameworks:
PyTorch Ecosystem | More Frameworks |
---|---|
PyTorch | TensorFlow |
PyTorch Lightning | Keras |
Hugging Face Transformers | Horovod |
Hugging Face Accelerate | XGBoost |
DeepSpeed | LightGBM |
Documentation: