Anybody managed to use Ray Train in a SageMaker Training cluster?

Lacruche · December 30, 2021, 2:02pm

Amazon SageMaker Training allows to run custom Python or docker-based instructions in a managed transient cluster of N EC2 machines. I’m trying to use Ray Train in such a SageMaker Training cluster. I’m looking at instructions here

To avoid possibly re-inventing the wheel: did anybody manage to run Ray Train in a SageMaker-managed EC2 cluster? I’m in particular interested in PyTorch data parallel training

Topic		Replies	Views
Ray Train code works locally, not in SageMaker PyTorch job Ray Train	15	1124	January 12, 2022
How to launch multi-node job with Ray Train? Ray Train	9	2076	June 14, 2024
Ray xgboost ray not use GPU training and OOM Ray Train	0	140	April 30, 2024
How to use BERT in ray cluster? Ray Clusters	1	697	April 20, 2021
Accessing Ray cluster in AWS Dashboard, Monitoring & Debugging	5	1763	January 29, 2021

Anybody managed to use Ray Train in a SageMaker Training cluster?

Related topics