Running batches of data by multiple work process

Hi,
I am new to Ray and just started with the evaluation to include in our project.

What would be the best way for parallel processing of data in batches using remote worker processes?
For ex: If I have few hundred records of data read from CSV file, it should be split in n batches and processed ( call a remote function) in parallel by n workers.
I have come across data sets especially map_batches.
https://docs.ray.io/en/latest/data/key-concepts.html?highlight=map_batches#dataset-transforms

Would you please suggest if this is the best approach or is there any alternative one?

Thanks in advance

hi @pratap123 yup Ray Datasets should be the idiomatic solution.

Thanks @Chen_Shen for the response. I had tried it out but I see its running locally in a single process not spawning ā€˜nā€™ worker processes.

Here is a sample code:

class Test:
    def __init__(self):
        pass

    def __call__(self, x):
        return []

ds.map_batches(Test, batch_format="pandas", batch_size=2, compute="tasks")

@Chen_Shen I have also tried iter_batches() and split() APIs,
I see both these APIs spawn worker processes as per batch size specified however is the splitted batch data reside in local or remotely? Can you please clarify?
Would you please suggest which is the best approach considering my usecase?
Thanks in advance.

ah @pratap123 you can use ds.repartition(n).map_batches(...) where the n controls the parallelism of the concurrent map_batches.

Thanks @Chen_Shen
With ds.repartition(n).map_batches(...) I see n processes are getting spawned.
Would you please clarify the differences between iter_batches() , split() and map_batches() APIs?