Placement group support in RaySGD?

I’ve read into the code a bit regarding placement group in RaySGD, but didn’t find an answer.

Let us say I have a cluster with two machines, A and B. I want to start 10 training processes on each machine. The learning relies on a in-memory data service, therefore we want to initialize training processes on A or B with different parameters (very simple just a Rank as int).

Is this already supported in RaySGD? I’ve read this PR: [tune/placement group] dist. training placement group support by oliverhu · Pull Request #11934 · ray-project/ray · GitHub, as well as how TorchTrainer and TrainingOperators are implemented. But cannot tell for sure.

Thanks a lot!

cc’ing @rliaw, who reviewed the linked PR–any thoughts about this?

Hmm, this is mainly supported for RaySGD + Ray TUne. However, TorchTrainer doesn’t have support for placement groups yet.

This should go onto our backlog; let me know if you would like us to prioritize it (or would be open to contributing this)! If interested, happy to guide you through the implementation.

Sure thing. I’d happy to contribute to that if it is something that my bandwidth can handle. Let me first get a rough idea how much effort it takes.

I am thinking for the first step, we are gonna create something like this:

  1. define number of workers for each placement group
  2. RaySGD takes this mapping of placement group → num_worker, then allocate worker accordingly.

Does this align with what you guys already thinking?

Not familiar with how RaySGD work under the hood yet. Will read a bit more into it. Any code pointer (e.g. similar functionality already exist in other modules, or tests are affected etc) or doc is much appreciated.

Actually, I think it makes sense for Ray side maintainer to implement this. Will cc you on the review!

Could you help post a github issue so we can keep track of it?

Done creating the issue. [raysgd] Placement group support in RaySGD · Issue #16682 · ray-project/ray · GitHub

Thanks, Richard!