Hi, I am using 48 core 3 workers and 48 core head node, my csv data contains around 10 crs records i using map batches function to parallelization for pdf geneartion using fillpdf library , can you please guid me optimization technique , how many batches and how many partition and other things i need
For optimizing Ray Data with 10 crore (100 million) records and a 192-core cluster (48 cores × 4 nodes), you should:
- Set the number of partitions (blocks) to 2–4× the number of available cores, so try 400–800 partitions for good parallelism. Use
ds.repartition(num_blocks=400)beforemap_batches(source, source). - Choose a
batch_sizethat balances memory usage and throughput. Start with 10,000–50,000 rows per batch, and monitor memory; adjust as needed (source). - Use
map_batchesfor vectorized operations, and setnum_cpusper task if each PDF generation is CPU-intensive (source). - Monitor with Ray Dashboard and adjust
num_blocksandbatch_sizeto avoid OOM and maximize CPU utilization (source).
Would you like a sample code snippet or more detailed tuning advice?
Sources:
- Ray Data Performance Tips
- Unstructured Data Ingestion Example
- https://github.com/ray-project/ray/blob/master/doc/source/data/examples/llm_batch_inference_text/content/batch_inference_text.py
Hint: Mention @RunLLM in the post for followups.
@RunLLM yes, can you please share a sample code snippet or more detailed tuning advice?