Running batches of data by multiple work process

pratap123 · April 4, 2022, 4:11pm

Hi,
I am new to Ray and just started with the evaluation to include in our project.

What would be the best way for parallel processing of data in batches using remote worker processes?
For ex: If I have few hundred records of data read from CSV file, it should be split in n batches and processed ( call a remote function) in parallel by n workers.
I have come across data sets especially map_batches.
https://docs.ray.io/en/latest/data/key-concepts.html?highlight=map_batches#dataset-transforms

Would you please suggest if this is the best approach or is there any alternative one?

Thanks in advance

Chen_Shen · April 5, 2022, 6:41am

hi @pratap123 yup Ray Datasets should be the idiomatic solution.

pratap123 · April 5, 2022, 11:11am

Thanks @Chen_Shen for the response. I had tried it out but I see its running locally in a single process not spawning ‘n’ worker processes.

Here is a sample code:

class Test:
    def __init__(self):
        pass

    def __call__(self, x):
        return []

ds.map_batches(Test, batch_format="pandas", batch_size=2, compute="tasks")

pratap123 · April 5, 2022, 3:07pm

@Chen_Shen I have also tried iter_batches() and split() APIs,
I see both these APIs spawn worker processes as per batch size specified however is the splitted batch data reside in local or remotely? Can you please clarify?
Would you please suggest which is the best approach considering my usecase?
Thanks in advance.

Chen_Shen · April 5, 2022, 6:20pm

ah @pratap123 you can use ds.repartition(n).map_batches(...) where the n controls the parallelism of the concurrent map_batches.

pratap123 · April 6, 2022, 8:15am

Thanks @Chen_Shen
With ds.repartition(n).map_batches(...) I see n processes are getting spawned.
Would you please clarify the differences between iter_batches() , split() and map_batches() APIs?

Topic		Replies	Views
Data set access per range by worker process	0	352	April 5, 2022
[Data] map_batches is not respecting concurrency from the beginning	1	168	December 6, 2024
Run ray dataset.map_batch in ray task Ray Client	0	29	November 27, 2024
Dataset support concurrency in one block when using map_batches	4	660	October 1, 2022
Async and dataset transformation Ray Data	5	46	April 1, 2025

Running batches of data by multiple work process

Related topics