How to divide data freely to worker?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

How to divide data freely to worker?
Now there are two models. In distributed training, data transfer is required. How to ensure that the data obtained by workers of the two models are consistent

Are you referring to data parallel training?

If you are using torch, you can take a look at ray/torch_trainer.py at master · ray-project/ray · GitHub. Similarly we have examples for other frameworks.

can I specify a data block to a specified woker?

Are you planning to use Ray Dataset?
There is indeed logic in Ray Dataset that decides which dataset block goes to which worker for distributed training. And if the remaining blocks cannot be divided evenly among workers, some logic makes sure to further divide up some blocks into smaller ones and assign them evenly to workers.
However, usually this kind of logic is not that important for end users to know about and we abstract that part away from end user. From their perspective, they have a Ray Dataset and all workers can get their shards of data evenly.
Do you mind telling us a bit more about your use case and why you would care about such details?

I want to use Ray Train for vertical federated learning, and provider and promoter workers need to accept consistent data.

I have another question, can I use TorchTrainer for the model that implements gradient update manually?

what do you mean by “consistent data”? My understanding is that the participants in VFL don’t need to share the sample space but rather feature space is split.
Could you speak more concretely about what is the concern around data consistency? Maybe something around how do you plan to map provider/promoter worker to ray actor and the data requirement. A schematic would be helpful.

As for gradient update, ray train doesn’t do any of the gradient/weight syncing, it’s all offered by torch. Again, I would be curious to learn why you would want to do that manually.

Hi @Jack_Chen , I can understand your pain points on federated learning scenario, like splitting your models to different parties.

We’re working on supporting federated learning on the top of Ray, in RayFed repo. But unfortunately, RayFed hasn’t ported RayTrain APIs.
If you have requirements on it or you’re interested in it, please let me know.

Thanks
Qing

@Jack_Chen Do you still have questions that have not been resolved. As @xwjiang2010 has suggested the capabilities of Ray Data and Ray Train: both are higher level abstractions so you don’t have to do all bits manually.

I would like to have control over how data is sent to workers. For example, if I want to train an autoregressive sequence model (not an LLM), it would be very memory inefficient to create a dataset where each row represents a record for training.