Loading pyarrow.Table directly into ObjectStore from Parquet

akoumjian · November 19, 2023, 2:46pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m currently running ray as a single node cluster, where I am primarily concerned with using Ray’s ObjectStore to share immutable datasets across different processes without duplicating objects in memory, and of course using ray.remote to run those parallel processes (It’s fantastic, thanks!).

I start with an initial parquet file (~8GB), which I load into a pyarrow.Table and then send it into the object store with ray.put in the main process. I then filter this down later using some parallel ray.remote functions, whereafter I can discard the original table.

This process works great, except for the initial step where I’ve loaded the pyarrow.Table into the main process’s memory before sending it off to the object store. This doubles the amount of peak memory consumption until can remove references to the non-shared version of the table.

I would love a simple way to send pyarrow.Table objects directly into the object store, without first loading into memory. I’ve explored the ray.data.Dataset but it looks to basically be doing the same thing.

I would love any recommendations.

Jules_Damji · November 20, 2023, 11:10pm

@akoumjian

I would love a simple way to send pyarrow.Table objects directly into the object store, without first loading into memory. I’ve explored the ray.data.Dataset but it looks to basically be doing the same thing.

Not sure if there’s a way to load object directly into Ray Object store without explicity ray.put(var), where the object has to exist in memory. There are ways to read a structured file as a pyarrow table, and then put that table into Ray object store.

cc: @chengsu Is there a way to directly load or read PyArrow table into Ray Object store, without first reading first as a Ray data object?

Topic		Replies	Views
Ray object store Ray Core	2	1005	April 1, 2022
Cannot read parquet files Ray Data	2	647	April 19, 2023
Reading Data in parallel from file and pushing to the plasma object store Ray Core	4	986	April 1, 2021
Ray Data read Parquet loads all the data in one go	4	584	October 21, 2023
Ray.data.read_parquet example on azure blob storage not working	0	194	May 10, 2024

Loading pyarrow.Table directly into ObjectStore from Parquet

Related topics