How to divide data freely to worker?

Jack_Chen · November 3, 2022, 11:48am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

How to divide data freely to worker?
Now there are two models. In distributed training, data transfer is required. How to ensure that the data obtained by workers of the two models are consistent

xwjiang2010 · November 3, 2022, 4:18pm

Are you referring to data parallel training?

If you are using torch, you can take a look at ray/torch_trainer.py at master · ray-project/ray · GitHub. Similarly we have examples for other frameworks.

Jack_Chen · November 4, 2022, 2:34am

can I specify a data block to a specified woker?

xwjiang2010 · November 4, 2022, 5:18pm

Are you planning to use Ray Dataset?
There is indeed logic in Ray Dataset that decides which dataset block goes to which worker for distributed training. And if the remaining blocks cannot be divided evenly among workers, some logic makes sure to further divide up some blocks into smaller ones and assign them evenly to workers.
However, usually this kind of logic is not that important for end users to know about and we abstract that part away from end user. From their perspective, they have a Ray Dataset and all workers can get their shards of data evenly.
Do you mind telling us a bit more about your use case and why you would care about such details?

Jack_Chen · November 5, 2022, 2:45am

I want to use Ray Train for vertical federated learning, and provider and promoter workers need to accept consistent data.

I have another question, can I use TorchTrainer for the model that implements gradient update manually?

xwjiang2010 · November 7, 2022, 7:38pm

what do you mean by “consistent data”? My understanding is that the participants in VFL don’t need to share the sample space but rather feature space is split.
Could you speak more concretely about what is the concern around data consistency? Maybe something around how do you plan to map provider/promoter worker to ray actor and the data requirement. A schematic would be helpful.

As for gradient update, ray train doesn’t do any of the gradient/weight syncing, it’s all offered by torch. Again, I would be curious to learn why you would want to do that manually.

jovany-wang · March 11, 2023, 5:14am

Hi @Jack_Chen , I can understand your pain points on federated learning scenario, like splitting your models to different parties.

We’re working on supporting federated learning on the top of Ray, in RayFed repo. But unfortunately, RayFed hasn’t ported RayTrain APIs.
If you have requirements on it or you’re interested in it, please let me know.

Thanks
Qing

Jules_Damji · March 14, 2023, 5:56pm

@Jack_Chen Do you still have questions that have not been resolved. As @xwjiang2010 has suggested the capabilities of Ray Data and Ray Train: both are higher level abstractions so you don’t have to do all bits manually.

121onto · April 11, 2024, 3:13am

I would like to have control over how data is sent to workers. For example, if I want to train an autoregressive sequence model (not an LLM), it would be very memory inefficient to create a dataset where each row represents a record for training.

Topic		Replies	Views
Ray Dataset with Distributed PyTorch Ray Data	1	600	April 22, 2022
Distributed torch model training with Ray Core APIs Ray Core	3	505	November 3, 2023
How to make each worker works only on its partition? Ray Train	2	569	August 1, 2022
Synchronizing workers during ray train Ray Train	8	830	February 25, 2025
DatasetConfig and streaming breaks a worker	0	155	July 27, 2023

How to divide data freely to worker?

Related topics