I’ve read into the code a bit regarding placement group in RaySGD, but didn’t find an answer.
Let us say I have a cluster with two machines, A and B. I want to start 10 training processes on each machine. The learning relies on a in-memory data service, therefore we want to initialize training processes on A or B with different parameters (very simple just a Rank as int).
Hmm, this is mainly supported for RaySGD + Ray TUne. However, TorchTrainer doesn’t have support for placement groups yet.
This should go onto our backlog; let me know if you would like us to prioritize it (or would be open to contributing this)! If interested, happy to guide you through the implementation.
Sure thing. I’d happy to contribute to that if it is something that my bandwidth can handle. Let me first get a rough idea how much effort it takes.
I am thinking for the first step, we are gonna create something like this:
define number of workers for each placement group
RaySGD takes this mapping of placement group → num_worker, then allocate worker accordingly.
Does this align with what you guys already thinking?
Not familiar with how RaySGD work under the hood yet. Will read a bit more into it. Any code pointer (e.g. similar functionality already exist in other modules, or tests are affected etc) or doc is much appreciated.