It seems likely that direct, official integration of Megatron-Core’s tensor/pipeline parallelism with Ray Train v2 is not currently supported. Ray Train v2 primarily supports data parallelism (DDP/FSDP) and does not natively handle the multi-dimensional rank and process group requirements of Megatron-Core, nor its custom data loading patterns. There is an open feature request for NeMo Megatron strategy integration, but as of now, this is not implemented in Ray Train v2 (Ray GitHub Issue #51387).
My understanding is that you should use Ray for cluster orchestration and resource management, but rely on Megatron-Core’s own launch and training scripts for actual model training if you need full tensor/pipeline parallelism. Ray Train v2’s orchestration is not designed for Megatron’s multi-dimensional parallelism out of the box (Ray GitHub Issue #51387). Would you like more detail on possible workarounds or hybrid approaches?
My understanding is that, since Ray Train v2 does not natively support Megatron-Core’s tensor/pipeline parallelism, the main workaround is to use Ray for cluster orchestration (resource allocation, job launching, monitoring), but run Megatron’s own distributed launcher for training. This means Ray manages the cluster and launches jobs, but Megatron handles all intra-job parallelism and process group setup (Ray GitHub Issue #51387).
A hybrid approach could involve using Ray Actors or Ray Core APIs to launch Megatron training scripts on each node, passing the correct environment variables and rank information, while letting Megatron initialize its own process groups. You can use Ray for pre-processing, data sharding, or post-processing, but keep the training loop and parallelism logic inside Megatron. This avoids conflicts in process group setup and data loading (Ray GitHub Issue #51387). Would you like a code example or more specifics on orchestrating Megatron jobs with Ray?