Hi, I am using ds.map_batches(generate_pdf_batch, batch_size=1000,batch_format=“pandas”) but this is running on single head node on single core, and other all workers and core’s are in ideal state, can you please help
Your code is only using the head node and a single core because, by default, Ray Data will use as many parallel tasks as there are blocks in your dataset. If your DataFrame is small or not partitioned, it will create only one block, resulting in a single task running on the head node. To utilize all workers and cores, repartition your dataset into more blocks before calling map_batches:
ds = ray.data.from_pandas(df)
ds = ds.repartition(8) # Set to the number of available CPUs or nodes
results = ds.map_batches(generate_pdf_batch, batch_size=1000, batch_format="pandas")
This will split your dataset into multiple blocks, allowing Ray to schedule tasks across all available resources. For more details, see the discussion and solution in Ray Discuss: Only head node is processing task.
Would you like more detail on how to choose the number of partitions or how Ray schedules tasks?
My dataset contain 1 cr records and i am using 32 core 3 workers with 16 core head node how many partition do i need and other things for more optimization