Ray optimization techniques

dgautam · February 24, 2026, 12:23pm

Hi, I am using 48 core 3 workers and 48 core head node, my csv data contains around 10 crs records i using map batches function to parallelization for pdf geneartion using fillpdf library , can you please guid me optimization technique , how many batches and how many partition and other things i need

RunLLM · February 24, 2026, 12:23pm

For optimizing Ray Data with 10 crore (100 million) records and a 192-core cluster (48 cores × 4 nodes), you should:

Set the number of partitions (blocks) to 2–4× the number of available cores, so try 400–800 partitions for good parallelism. Use ds.repartition(num_blocks=400) before map_batches (source, source).
Choose a batch_size that balances memory usage and throughput. Start with 10,000–50,000 rows per batch, and monitor memory; adjust as needed (source).
Use map_batches for vectorized operations, and set num_cpus per task if each PDF generation is CPU-intensive (source).
Monitor with Ray Dashboard and adjust num_blocks and batch_size to avoid OOM and maximize CPU utilization (source).

Would you like a sample code snippet or more detailed tuning advice?

Sources:

Hint: Mention @RunLLM in the post for followups.

dgautam · February 24, 2026, 12:25pm

@RunLLM yes, can you please share a sample code snippet or more detailed tuning advice?

Topic		Replies	Views
Map_batches Optimization	2	48	February 24, 2026
Ray data experience OOM issue during write_csv or write_parquet Ray Data	2	561	August 2, 2023
[Data] map_batches is not respecting concurrency from the beginning	1	343	December 6, 2024
Running batches of data by multiple work process Ray Core	5	599	April 6, 2022
Dataset support concurrency in one block when using map_batches	4	858	October 1, 2022

Ray optimization techniques

Related topics