Ray optimization techniques

Hi, I am using 48 core 3 workers and 48 core head node, my csv data contains around 10 crs records i using map batches function to parallelization for pdf geneartion using fillpdf library , can you please guid me optimization technique , how many batches and how many partition and other things i need

For optimizing Ray Data with 10 crore (100 million) records and a 192-core cluster (48 cores × 4 nodes), you should:

  • Set the number of partitions (blocks) to 2–4× the number of available cores, so try 400–800 partitions for good parallelism. Use ds.repartition(num_blocks=400) before map_batches (source, source).
  • Choose a batch_size that balances memory usage and throughput. Start with 10,000–50,000 rows per batch, and monitor memory; adjust as needed (source).
  • Use map_batches for vectorized operations, and set num_cpus per task if each PDF generation is CPU-intensive (source).
  • Monitor with Ray Dashboard and adjust num_blocks and batch_size to avoid OOM and maximize CPU utilization (source).

Would you like a sample code snippet or more detailed tuning advice?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM yes, can you please share a sample code snippet or more detailed tuning advice?