Loading pyarrow.Table directly into ObjectStore from Parquet

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m currently running ray as a single node cluster, where I am primarily concerned with using Ray’s ObjectStore to share immutable datasets across different processes without duplicating objects in memory, and of course using ray.remote to run those parallel processes (It’s fantastic, thanks!).

I start with an initial parquet file (~8GB), which I load into a pyarrow.Table and then send it into the object store with ray.put in the main process. I then filter this down later using some parallel ray.remote functions, whereafter I can discard the original table.

This process works great, except for the initial step where I’ve loaded the pyarrow.Table into the main process’s memory before sending it off to the object store. This doubles the amount of peak memory consumption until can remove references to the non-shared version of the table.

I would love a simple way to send pyarrow.Table objects directly into the object store, without first loading into memory. I’ve explored the ray.data.Dataset but it looks to basically be doing the same thing.

I would love any recommendations.

@akoumjian

I would love a simple way to send pyarrow.Table objects directly into the object store, without first loading into memory. I’ve explored the ray.data.Dataset but it looks to basically be doing the same thing.

Not sure if there’s a way to load object directly into Ray Object store without explicity ray.put(var), where the object has to exist in memory. There are ways to read a structured file as a pyarrow table, and then put that table into Ray object store.

cc: @chengsu Is there a way to directly load or read PyArrow table into Ray Object store, without first reading first as a Ray data object?