[Ray Data] Apparent count() bug

Dean_Wampler · March 27, 2023, 3:27pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Using the Parquet Row Tuning example, it appears that calling count() for the resulting dataset returns the number of rows for the input dataset, not the resulting filtered data set. Here’s a program illustrating the behavior.

import ray
import pyarrow as pa

all = ray.data.read_parquet("example://iris.parquet")
filtered = ray.data.read_parquet("example://iris.parquet", filter=pa.dataset.field("sepal.length") > 5.0)

print(all.count())                # prints 150
print(filtered.count())           # prints 150, but shouldn't it be smaller??

num_rows = 0
for _ in filtered.iter_rows():
	num_rows += 1
print(num_rows)                   # prints 118

If this is a bug, I’ll open a ticket for it.

ckapoor · March 27, 2023, 5:15pm

Would be worth try adding fully_executed()

My guess it only read the Parquet metadata for that file.

https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.fully_executed.html?highlight=fully_executed

Dean_Wampler · March 27, 2023, 5:39pm

That worked! Changing

print(filtered.count())           # prints 150, but shouldn't it be smaller??

to

filtered.fully_executed()
print(filtered.count())           # now prints 118

So, it makes ssense that it prints the Parquet metadata the first time, but it is confusing. I believe Spark DataFrames would execute the logic for the equivalent count(), if I’m not mistaken. I kind of assumed it would happen here, but now that I look more closely at the output, it doesn’t execute anything for count().

Should I create a PR for the docs to say that count() has this behavior?

ckapoor · March 27, 2023, 7:43pm

The documentation on the fully_executed states the behavior.

I agree coming from Spark it breaks the mental model.

sjl · March 27, 2023, 10:31pm

Hi all, thank you for reporting this bug. I have opened an issue summarizing the discussion in this thread: [Dataset] Filtered Parquet `dataset.count()` returns incorrect row count based on input Dataset metadata · Issue #33766 · ray-project/ray · GitHub

Let’s move further discussion to this issue, and thanks again for reporting!

Topic		Replies	Views
Ray.data.filter() much slower than without filter Ray Data	8	290	March 6, 2025
Ray data creating multiple datasets and repeating map operations on ray dashboard Ray Train	2	198	November 21, 2024
Unable to add column to ray dataset read via parquet	1	294	November 2, 2023
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	1941	November 17, 2022
Pyarrow.concat_tables with ray.data.read_parquet gives schema mismatch error when filters are applied Ray Data	2	902	August 19, 2021

[Ray Data] Apparent count() bug

Related topics