How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’m currently running ray as a single node cluster, where I am primarily concerned with using Ray’s ObjectStore to share immutable datasets across different processes without duplicating objects in memory, and of course using
ray.remote to run those parallel processes (It’s fantastic, thanks!).
I start with an initial parquet file (~8GB), which I load into a
pyarrow.Table and then send it into the object store with
ray.put in the main process. I then filter this down later using some parallel
ray.remote functions, whereafter I can discard the original table.
This process works great, except for the initial step where I’ve loaded the
pyarrow.Table into the main process’s memory before sending it off to the object store. This doubles the amount of peak memory consumption until can remove references to the non-shared version of the table.
I would love a simple way to send
pyarrow.Table objects directly into the object store, without first loading into memory. I’ve explored the
ray.data.Dataset but it looks to basically be doing the same thing.
I would love any recommendations.