Ray read_iceberg doesn't scale at large iceberg table

Jimmy_Xie · November 27, 2024, 9:57pm

How severe does this issue affect your experience of using Ray?
High

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.

We are trying to scale up our Ray data processing pipeline. However, the read_iceberg function (which interacts with our AWS Glue Data Catalog on S3) doesn’t seem to scale with the size of the table or the cluster.

When reading a small table with 100k rows or less, everything works as expected. However, when we point it to our production table, the process completely freezes with no logs or progress updates in the log.

here is how we call read iceberg

ds = ray.data.read_iceberg(
table_identifier=table_identifier,
catalog_kwargs={“name”: catalog_name, “type”: “glue”},
)

Any ideas or pointers to help debug the issue would be nice.

Topic		Replies	Views
Best practices around handling giant datasets with ray data (large amount of read tasks)	5	107	October 15, 2024
Ray snowflake connector Ray Data	3	800	February 9, 2024
Ray data.read_csv keeps pausing Ray Data	3	388	September 28, 2023
Problem with anything on Ray Ray Data	2	602	April 20, 2022
Why isn't `ray.data.read_api._get_reader` parallelized?	0	190	December 5, 2023

Ray read_iceberg doesn't scale at large iceberg table

Related topics