Init device mesh in pytorch distributed

navmarri1 · April 24, 2025, 11:28pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.4
Python version: 3.10
OS: ubuntu
Cloud/Infrastructure:
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected:
I am planning to torch device mesh when using ray train. However, ray train initializes dist.init_process_group() as default (ref). This is needed since fsdp2 requires a device mesh to customize sharding strategies. Is there a workaround for this?

rliaw · April 25, 2025, 5:22pm

hi, this is a cool use case. Can you share about about what you want to do with your sharding strategies?

navmarri1 · April 26, 2025, 6:08pm

FSDP2 expects device_mesh to get the device placement and infer sharding strategy. at the moment I’m looking at hybrid sharding. However, device_mesh doesn’t seem to be an option in ray train. It doesn’t seem to be a blocker since init_device_mesh already checks if dist.is_initialized() and doesn’t throw error for duplicate initialization.
It would be great to provide user choice to choose between init_device_mesh vs init_process_group

Topic		Replies	Views
FSDP2 support for PyTorch ray train Ray Train	1	173	January 31, 2025
Distributed training in PyTorch and init_process_group Ray Tune	12	4039	September 7, 2021
[SGD][Tune] Do I need to specify cuda device id in TrainOperator Ray Tune	11	392	May 26, 2021
Is it possible to implement FSDP with Ray Trainer with naive pytorch?	1	310	April 4, 2024
Error: RuntimeError: No rendezvous handler for env:// Ray Train	5	817	April 5, 2023

Init device mesh in pytorch distributed

Related topics