Placement group support in RaySGD?

HuangLED · June 15, 2021, 9:17pm

I’ve read into the code a bit regarding placement group in RaySGD, but didn’t find an answer.

Let us say I have a cluster with two machines, A and B. I want to start 10 training processes on each machine. The learning relies on a in-memory data service, therefore we want to initialize training processes on A or B with different parameters (very simple just a Rank as int).

Is this already supported in RaySGD? I’ve read this PR: [tune/placement group] dist. training placement group support by oliverhu · Pull Request #11934 · ray-project/ray · GitHub, as well as how TorchTrainer and TrainingOperators are implemented. But cannot tell for sure.

Thanks a lot!

architkulkarni · June 17, 2021, 4:55am

cc’ing @rliaw, who reviewed the linked PR–any thoughts about this?

rliaw · June 20, 2021, 7:46am

Hmm, this is mainly supported for RaySGD + Ray TUne. However, TorchTrainer doesn’t have support for placement groups yet.

This should go onto our backlog; let me know if you would like us to prioritize it (or would be open to contributing this)! If interested, happy to guide you through the implementation.

HuangLED · June 25, 2021, 5:03pm

Sure thing. I’d happy to contribute to that if it is something that my bandwidth can handle. Let me first get a rough idea how much effort it takes.

I am thinking for the first step, we are gonna create something like this:

define number of workers for each placement group
RaySGD takes this mapping of placement group → num_worker, then allocate worker accordingly.

Does this align with what you guys already thinking?

Not familiar with how RaySGD work under the hood yet. Will read a bit more into it. Any code pointer (e.g. similar functionality already exist in other modules, or tests are affected etc) or doc is much appreciated.

rliaw · June 25, 2021, 6:15pm

Actually, I think it makes sense for Ray side maintainer to implement this. Will cc you on the review!

Could you help post a github issue so we can keep track of it?

HuangLED · June 25, 2021, 11:44pm

Done creating the issue. [raysgd] Placement group support in RaySGD · Issue #16682 · ray-project/ray · GitHub

Thanks, Richard!

Topic		Replies	Views
Distributed training in PyTorch and init_process_group Ray Tune	12	3605	September 7, 2021
How does RaySGD work on top of torch.dist.launch? Ray Tune	3	497	June 16, 2021
ScalingConfig() num_workers not corresponding to training runs? Ray Train	8	655	February 5, 2024
Cpu allocation confusion Ray Libraries (Data, Train, Tune, Serve)	3	1179	March 7, 2023
[SGD] Hydra + RaySGD (PyTorch Lightning) Ray Tune	2	596	June 15, 2021

Placement group support in RaySGD?

Related topics