Why is Ray Data running out of memory?

btian · December 15, 2025, 7:47pm

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

Ray version: 2.52.1
Python version: 3.12.3
OS: Ubuntu
Cloud/Infrastructure:
Other libs/tools (if relevant): PyIceberg 0.10.0

3. What happened vs. what you expected:

Expected: Ray Data should print data quickly
Actual: OOM even on a machine with 900 GB of RAM

I’m trying to do something really simple.

    from ray.data import read_iceberg
    ds = read_iceberg(
        table_identifier=TABLE,
        catalog_kwargs=CATALOG_ARGS,
        snapshot_id=3554537921535773348,
    )
    ds = ds.select_columns(["id", "vehicle_name", "city", "datasets"]).show(2)

But Ray runs out of memory right away, even on a machine with 900GB of ram.

All the selected columns are short strings.

Update: I tried override_num_blocks=1 as suggested by RunLLM, still went out of memory.

RunLLM · December 15, 2025, 7:48pm

Ray Data can run out of memory even with large RAM if the dataset is very large, the number of blocks is too small, or the block size is too big. By default, Ray Data tries to fit blocks into memory, but if a block is too large (e.g., due to few blocks or large files), it can cause OOM regardless of the total RAM. This is especially likely if the underlying Iceberg table is large and the parallelism (number of blocks) is not set high enough, causing each block to be massive in memory—even if you select only a few columns of short strings. This is a known issue and can be mitigated by increasing the number of output blocks using the override_num_blocks parameter in read_iceberg, which splits the data into smaller, more manageable chunks for processing and avoids OOM errors. See Ray Data performance tips and read_iceberg API docs.

Try:

ds = read_iceberg(
    table_identifier=TABLE,
    catalog_kwargs=CATALOG_ARGS,
    snapshot_id=3554537921535773348,
    override_num_blocks=1000,  # or higher, depending on table size
)
ds = ds.select_columns(["id", "vehicle_name", "city", "datasets"]).show(2)

Would you like more detail on how to choose the right number of blocks or further memory tuning?

Sources:

Hint: Mention @RunLLM in the post for followups.

btian · December 23, 2025, 11:21pm

The solution is to set override_num_blocks to a large number. I set it to 100_000, didn’t get OOM.

Topic		Replies	Views
OOM reading "small" parquet file Ray Data	2	1343	September 1, 2022
Can Ray Data be used on Datasets that don't fit in memory?	1	415	January 18, 2024
Debugging Ray Data out of memory errors	0	259	February 8, 2024
Ray data experience OOM issue during write_csv or write_parquet Ray Data	2	550	August 2, 2023
Read large webdatasets	1	117	October 22, 2024

Why is Ray Data running out of memory?

Related topics