Ray.data.filter() much slower than without filter

zhh210 · August 22, 2024, 6:54pm

Trying to filter on partitioned parquet files (700 files) of 70M rows and 35 columns (numeric) but it becomes much slower than without filtering on a machine with 8 cpus:

## without filter
%%time
import ray
dt = ray.data.read_parquet('df_data/date_key=2024-02-29')
dt.count()

which takes about 12 minutes which is fine.

CPU times: user 38.4 s, sys: 4.53 s, total: 43 s
Wall time: 12min 38s

However, it will become much slower if I do a filtering:

%%time
dt.filter(lambda x: False if pd.isnull(x['account_created_date']) else (pendulum.parse(x['date_key']) - pendulum.parse(x['account_created_date'])).days <= 30, concurrency=12, num_cpus=0.2).count()

It takes hours without any sign of finishing. It is also extremely slow when I’m trying to do a limit after filter (dt.filter(*).limit(3).show()). Any solutions for this?

sjl · August 22, 2024, 8:12pm

The reason for the time difference between the two is because for purely Parquet read Datasets (i.e. just read_parquet()), this does not trigger execution of the full dataset, and gets the row count from the file metadata. In comparison, read_parquet().filter() will require execution of the dataset in order to get row count, as it is no longer possible to get the row count purely from file metadata.

On the other hand, I would expect that dt.filter(*).limit(3).show() should run relatively fast, since this would only require executing the dataset just enough to get 3 rows. Do you observe the same slowness if you omit the filter (i.e. just dt.limit(3).show())?

zhh210 · August 22, 2024, 8:40pm

I was also expecting dt.filter(*).limit(3).show()to be fast but it’s also taking hours not returning the output. It seems ray==2.34.0 is not aware of the limit of 3 while running filter() . Yesdt.limit(3).show() is fast only takes a few seconds.

sjl · August 22, 2024, 10:24pm

Thanks for reporting, I have opened a new GH issue here: [Data] `Limit` operation does not early-stop dataset execution · Issue #47287 · ray-project/ray · GitHub

zhh210 · August 23, 2024, 2:57pm

Thanks @sjl , I did rerun the dt.filter(*).limit(3).show() in a clean environment and it is able to finish in a few minutes. Last time it was taking hours because all cpus were occupied by the other ray job. So I think ray is still doing optimization when doing a limit after filter but just not as fast as expected.

sjl · August 23, 2024, 5:17pm

Ah got it, so the original issue seems to be resolved?

I think the difference in time that I observed in the reproducible example in my GH issue is due to the metadata fetching time (one vs all files). I’ll go ahead and close off the issue for now, please feel free to follow up or re-open if you run into it again. Thanks!

zhh210 · August 23, 2024, 6:08pm

The original issue is still there:dt.filter(*).count() is taking forever to run even dt.count() takes 12 minutes to return the count.

zhh210 · February 20, 2025, 8:36pm

An update on this: I also did a comparison of iteration rather than counting and see the same behavior on 6M rows.

%%time
import torch
for batch in df.iter_torch_batches(batch_size=1600, dtypes=torch.float32):
    pass

It’s quite fast:

CPU times: user 17.7 s, sys: 2.87 s, total: 20.5 s
Wall time: 1min 54s

Adding a filter significantly slows down the iteration despite the underlying data has smaller population:

%%time
import torch
for batch in df.filter(lambda x: x['feature1'] > 0).iter_torch_batches(batch_size=1600, dtypes=torch.float32):
    pass

Much slower:

CPU times: user 10min 21s, sys: 35.3 s, total: 10min 57s
Wall time: 3h 21min 30s

rliaw · March 6, 2025, 10:42pm

Hey @zhh210, can you try instead pushing the filter into read_parquet like:

import pyarrow as pa
df = ray.data.read_parquet(..., filter=pa.dataset.field("feature1") > 0)

Topic		Replies	Views
[Ray Data] Apparent count() bug Ray Data	4	365	March 27, 2023
Ray data creating multiple datasets and repeating map operations on ray dashboard Ray Train	2	166	November 21, 2024
Ray data.read_csv keeps pausing Ray Data	3	412	September 28, 2023
Data loading of parquet files is very memory consuming Ray Data	2	1428	June 21, 2022
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	1900	November 17, 2022

Ray.data.filter() much slower than without filter

Related topics