How severe does this issue affect your experience of using Ray?
Low: It annoys or frustrates me for a moment.
Using the Parquet Row Tuning example, it appears that calling count() for the resulting dataset returns the number of rows for the input dataset, not the resulting filtered data set. Here’s a program illustrating the behavior.
import ray
import pyarrow as pa
all = ray.data.read_parquet("example://iris.parquet")
filtered = ray.data.read_parquet("example://iris.parquet", filter=pa.dataset.field("sepal.length") > 5.0)
print(all.count()) # prints 150
print(filtered.count()) # prints 150, but shouldn't it be smaller??
num_rows = 0
for _ in filtered.iter_rows():
num_rows += 1
print(num_rows) # prints 118
print(filtered.count()) # prints 150, but shouldn't it be smaller??
to
filtered.fully_executed()
print(filtered.count()) # now prints 118
So, it makes ssense that it prints the Parquet metadata the first time, but it is confusing. I believe Spark DataFrames would execute the logic for the equivalent count(), if I’m not mistaken. I kind of assumed it would happen here, but now that I look more closely at the output, it doesn’t execute anything for count().
Should I create a PR for the docs to say that count() has this behavior?