How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’m currently running ray as a single node cluster, where I am primarily concerned with using Ray’s ObjectStore to share immutable datasets across different processes without duplicating objects in memory, and of course using ray.remote
to run those parallel processes (It’s fantastic, thanks!).
I start with an initial parquet file (~8GB), which I load into a pyarrow.Table
and then send it into the object store with ray.put
in the main process. I then filter this down later using some parallel ray.remote
functions, whereafter I can discard the original table.
This process works great, except for the initial step where I’ve loaded the pyarrow.Table
into the main process’s memory before sending it off to the object store. This doubles the amount of peak memory consumption until can remove references to the non-shared version of the table.
I would love a simple way to send pyarrow.Table
objects directly into the object store, without first loading into memory. I’ve explored the ray.data.Dataset
but it looks to basically be doing the same thing.
I would love any recommendations.