Distributed torch model training with Ray Core APIs

noah822 · November 2, 2023, 2:29am

Hi there!
I am currently implementing my own project with PyTorch and Ray. I wonder how can I can distribute my model to different ray nodes. I want it to work in a similar way as torch DistributedDataParallel, which launches multiple processes and distributes the work load in a single program multiple data fashion over physical hardware, for example one cpu and one gpu per process.

My setting is: say I have a ray actor, each of them can produce their own training data via simulation[already achieved]

@ray.remote(num_cpus=1, num_gpus=0.5)
class MyActor:
        ...

I want to instantiate 10 much actors to do simulation and distribute my model to train on them in parallel. After each actor trains, say 5 epochs, I synchronize their gradients and re-distribute the synced model. This workflow proceeds.

I know there is a ray train library out there that might help to achieve this goal. But I want to achieve this with pure Ray Core APIs somehow (mainly because I do not want to refactor my codes that much). Is there anyway to do this? Any suggestion is welcomed! Thanks in advance

zhz · November 2, 2023, 4:13am

@noah822 Have you considered using Ray Train Ray Train: Scalable Model Training — Ray 2.7.1?

Hasnain_Fareed · November 2, 2023, 1:41pm

Yes, you can achieve this with Ray’s Actor API. You can define your model as an actor and then instantiate multiple actors to train your model in parallel. Here is a simplified example:

import ray

@ray.remote

class ModelActor:

def init(self, model):

self.model = model

def train(self, data):

Your training logic here

…

return gradients

Initialize Ray

ray.init()

Create your model

model = …

Create actors

actors = [ModelActor.remote(model) for _ in range(10)]

Train model on each actor

for _ in range(5): # 5 epochs

futures = [actor.train.remote(data) for actor in actors]

gradients = ray.get(futures)

Synchronize gradients and update model

…

In this example, ModelActor is an actor class that wraps your model. Each actor runs on a separate process and can utilize one CPU and one GPU as specified in the ray.remote decorator. The train method is where you put your training logic. After each epoch, you collect the gradients from each actor, synchronize them, and update your model.

Please note that this is a simplified example and you might need to adjust it according to your specific needs. For example, you might need to handle the distribution and collection of training data, and the synchronization and distribution of the updated model.

For more details, you can refer to the Ray’s Actor API and the Ray Core API guide.

If you find it challenging to implement this with Ray Core APIs, you might want to consider using Ray Train, which provides a simple API for distributed training and handles many of the complexities for you.

noah822 · November 3, 2023, 6:51am

Thank you @zhz and @Hasnain_Fareed!
I am trying to implement the approach @Hasnain_Fareed suggested. Just want to make things a little bit clearer:

I notice that there is a master node that periodically collects gradients, aggregates and re-broadcasts new models to training nodes. Will those gradients reported by training node go through object tables stored in RAM or they can take advantage of GPU’s NVLink directly for interprocess communication?

Also, can I handle the dependency directly on the training node without introducing a master parameter server, do something like

# store futures from all other training nodes
actor_0: 
ray.get([actor_0_future, actor_1_future, actor_2_future])
actor_1: 
ray.get([actor_0_future, actor_1_future, actor_2_future])
actor_2: 
ray.get([actor_0_future, actor_1_future, actor_2_future])

specifically, actor’s future here refers to its local gradient. I want ray to help me schedule the communication between actors. Is this a valid approach?

Topic		Replies	Views
Best practice to share a torch model across actors Ray Core	4	575	December 27, 2022
Model Parallelism in Ray Ray Train	9	2588	November 18, 2023
Ray pytorch model partition Ray Core	1	17	October 31, 2024
Ray actor multiple gpu available but only one used Ray Core	3	22	October 4, 2024
Ray multiprocessing together with distributed learning Ray Train	1	547	March 2, 2022