Why is Ray Data running out of memory?

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.52.1
  • Python version: 3.12.3
  • OS: Ubuntu
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant): PyIceberg 0.10.0

3. What happened vs. what you expected:

  • Expected: Ray Data should print data quickly
  • Actual: OOM even on a machine with 900 GB of RAM

I’m trying to do something really simple.

    from ray.data import read_iceberg
    ds = read_iceberg(
        table_identifier=TABLE,
        catalog_kwargs=CATALOG_ARGS,
        snapshot_id=3554537921535773348,
    )
    ds = ds.select_columns(["id", "vehicle_name", "city", "datasets"]).show(2)

But Ray runs out of memory right away, even on a machine with 900GB of ram.

All the selected columns are short strings.

Update: I tried override_num_blocks=1 as suggested by RunLLM, still went out of memory.

Ray Data can run out of memory even with large RAM if the dataset is very large, the number of blocks is too small, or the block size is too big. By default, Ray Data tries to fit blocks into memory, but if a block is too large (e.g., due to few blocks or large files), it can cause OOM regardless of the total RAM. This is especially likely if the underlying Iceberg table is large and the parallelism (number of blocks) is not set high enough, causing each block to be massive in memory—even if you select only a few columns of short strings. This is a known issue and can be mitigated by increasing the number of output blocks using the override_num_blocks parameter in read_iceberg, which splits the data into smaller, more manageable chunks for processing and avoids OOM errors. See Ray Data performance tips and read_iceberg API docs.

Try:

ds = read_iceberg(
    table_identifier=TABLE,
    catalog_kwargs=CATALOG_ARGS,
    snapshot_id=3554537921535773348,
    override_num_blocks=1000,  # or higher, depending on table size
)
ds = ds.select_columns(["id", "vehicle_name", "city", "datasets"]).show(2)

Would you like more detail on how to choose the right number of blocks or further memory tuning?

Sources:

Hint: Mention @RunLLM in the post for followups.

The solution is to set override_num_blocks to a large number. I set it to 100_000, didn’t get OOM.