Map_batches Optimization

dgautam · February 24, 2026, 11:33am

Hi, I am using ds.map_batches(generate_pdf_batch, batch_size=1000,batch_format=“pandas”) but this is running on single head node on single core, and other all workers and core’s are in ideal state, can you please help

my code look like

try:
for index,row in batch.iterrows():

    output_filename = f"/deepak/row_{row['twsss_id']}_Report.pdf"


    field_data = {
        'ucc': row['twsss_uniq_clnt_cd'],
        'nec': row['twsss_ldgr_bal_fr_fnds'],
        'tisin': row['twsss_id'],
        'tqs': row['twsss_ttl_qty_of_sec'],
        'tncom': row['twsss_ttl_no_of_com']
    }

    fillpdfs.write_fillable_pdf(' sejda.pdf',output_filename,field_data,flatten=True)

except Exception as e:
print(f"An error occurred: {e}")
return pd.DataFrame({‘status’: [‘done’] * len(batch)})

ds = ray.data.from_pandas(df)
#ray.init(dashboard_host=‘0.0.0.0’)
results = ds.map_batches(generate_pdf_batch, batch_size=1000,batch_format=“pandas”)

RunLLM · February 24, 2026, 11:33am

Your code is only using the head node and a single core because, by default, Ray Data will use as many parallel tasks as there are blocks in your dataset. If your DataFrame is small or not partitioned, it will create only one block, resulting in a single task running on the head node. To utilize all workers and cores, repartition your dataset into more blocks before calling map_batches:

ds = ray.data.from_pandas(df)
ds = ds.repartition(8)  # Set to the number of available CPUs or nodes
results = ds.map_batches(generate_pdf_batch, batch_size=1000, batch_format="pandas")

This will split your dataset into multiple blocks, allowing Ray to schedule tasks across all available resources. For more details, see the discussion and solution in Ray Discuss: Only head node is processing task.

Would you like more detail on how to choose the number of partitions or how Ray schedules tasks?

Sources:

Hint: Mention @RunLLM in the post for followups.

dgautam · February 24, 2026, 11:38am

My dataset contain 1 cr records and i am using 32 core 3 workers with 16 core head node how many partition do i need and other things for more optimization

Topic		Replies	Views
Ray optimization techniques	2	6	February 24, 2026
[Data] map_batches is not respecting concurrency from the beginning	1	311	December 6, 2024
Running batches of data by multiple work process Ray Core	5	563	April 6, 2022
Dataset support concurrency in one block when using map_batches	4	810	October 1, 2022
Single node, 4x GPU, map_batches only using 1 Ray Data	3	780	October 5, 2023

Map_batches Optimization

Related topics