Amazon SageMaker Training allows to run custom Python or docker-based instructions in a managed transient cluster of N EC2 machines. I’m trying to use Ray Train in such a SageMaker Training cluster. I’m looking at instructions here
To avoid possibly re-inventing the wheel: did anybody manage to run Ray Train in a SageMaker-managed EC2 cluster? I’m in particular interested in PyTorch data parallel training