How to integrate Megatron-Core with Ray Train v2 for large language model training?

psnilesh · January 9, 2026, 3:52am

I need to train 70B+ parameter models using Megatron-Core’s tensor/pipeline parallelism, but want Ray Train v2’s orchestration.

My guess is this isn’t supported because:

Ray Train’s TorchConfig assumes data parallelism only (DDP/FSDP)
Megatron requires multi-dimensional ranks (tp_rank, pp_rank) vs Ray’s linear ranks
Different process group initialization patterns
Conflicting data loading assumptions (Ray distributes to all workers, Megatron only to specific ranks)

Is Megatron-Core + Ray Train v2 integration officially supported, or should I use Ray for cluster orchestration only and pure Megatron for training?

RunLLM · January 9, 2026, 3:52am

It seems likely that direct, official integration of Megatron-Core’s tensor/pipeline parallelism with Ray Train v2 is not currently supported. Ray Train v2 primarily supports data parallelism (DDP/FSDP) and does not natively handle the multi-dimensional rank and process group requirements of Megatron-Core, nor its custom data loading patterns. There is an open feature request for NeMo Megatron strategy integration, but as of now, this is not implemented in Ray Train v2 (Ray GitHub Issue #51387).

My understanding is that you should use Ray for cluster orchestration and resource management, but rely on Megatron-Core’s own launch and training scripts for actual model training if you need full tensor/pipeline parallelism. Ray Train v2’s orchestration is not designed for Megatron’s multi-dimensional parallelism out of the box (Ray GitHub Issue #51387). Would you like more detail on possible workarounds or hybrid approaches?

Sources:

Ray GitHub Issue #51387

Hint: Mention @RunLLM in the post for followups.

psnilesh · January 9, 2026, 3:56am

@RunLLM Yes, let me hear those workarounds and hybrid approaches.

RunLLM · January 9, 2026, 3:56am

My understanding is that, since Ray Train v2 does not natively support Megatron-Core’s tensor/pipeline parallelism, the main workaround is to use Ray for cluster orchestration (resource allocation, job launching, monitoring), but run Megatron’s own distributed launcher for training. This means Ray manages the cluster and launches jobs, but Megatron handles all intra-job parallelism and process group setup (Ray GitHub Issue #51387).

A hybrid approach could involve using Ray Actors or Ray Core APIs to launch Megatron training scripts on each node, passing the correct environment variables and rank information, while letting Megatron initialize its own process groups. You can use Ray for pre-processing, data sharding, or post-processing, but keep the training loop and parallelism logic inside Megatron. This avoids conflicts in process group setup and data loading (Ray GitHub Issue #51387). Would you like a code example or more specifics on orchestrating Megatron jobs with Ray?

Sources:

Ray GitHub Issue #51387

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Ray train not work in pretrain model Ray Train	1	758	March 28, 2023
Model Parallelism in Ray Ray Train	10	3281	January 9, 2026
Ray 2.7.1, ray.train and aws neuron cores	0	268	October 26, 2023
Seeking recommendation for training Detectron2 with Ray Tune	15	2001	March 14, 2024
Ray Train hangs for long time Ray Train	11	1887	July 20, 2022

How to integrate Megatron-Core with Ray Train v2 for large language model training?

Related topics