[Ray Data] Apparent count() bug

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Using the Parquet Row Tuning example, it appears that calling count() for the resulting dataset returns the number of rows for the input dataset, not the resulting filtered data set. Here’s a program illustrating the behavior.

import ray
import pyarrow as pa

all = ray.data.read_parquet("example://iris.parquet")
filtered = ray.data.read_parquet("example://iris.parquet", filter=pa.dataset.field("sepal.length") > 5.0)

print(all.count())                # prints 150
print(filtered.count())           # prints 150, but shouldn't it be smaller??

num_rows = 0
for _ in filtered.iter_rows():
	num_rows += 1
print(num_rows)                   # prints 118

If this is a bug, I’ll open a ticket for it.

Would be worth try adding fully_executed()

My guess it only read the Parquet metadata for that file.


That worked! Changing

print(filtered.count())           # prints 150, but shouldn't it be smaller??


print(filtered.count())           # now prints 118

So, it makes ssense that it prints the Parquet metadata the first time, but it is confusing. I believe Spark DataFrames would execute the logic for the equivalent count(), if I’m not mistaken. I kind of assumed it would happen here, but now that I look more closely at the output, it doesn’t execute anything for count().

Should I create a PR for the docs to say that count() has this behavior?

The documentation on the fully_executed states the behavior.

I agree coming from Spark it breaks the mental model.

Hi all, thank you for reporting this bug. I have opened an issue summarizing the discussion in this thread: [Dataset] Filtered Parquet `dataset.count()` returns incorrect row count based on input Dataset metadata · Issue #33766 · ray-project/ray · GitHub

Let’s move further discussion to this issue, and thanks again for reporting!