When using a single-node cluster, how to efficiently share a dataframe (for read-only access) between ray actors/tasks?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have to calculate N metrics on a Pandas dataframe (accuracy, roc_auc, PR-curve etc). Ideally I would like each metric to be calculated in parallel using a different Ray actor/task. The dataframe itself can be large (10MM rows).

What is the best way to achieve this via Ray actors or tasks? Do I copy the dataframe to the object store via ray.put()?

At the moment, I am only using a single-node cluster, but hope to scale to multi-node clusters soon.

Thanks,
Abhishek

Yes, using ray.put() to store the dataframe in the object store, or passing the dataframe as arg in .remote() calls, should allow sharing the dataframe between different Ray tasks / actors with zero copy. The only caveat is that the dataframe cannot contain Python objects.

Consider using Datasets, which can load dataframes via ray.data.from_pandas(pandas_df). For example, to load a dataframe from pandas and split it into 100 blocks for parallel computation, do:

def batch_fn(batch: pd.DataFrame):
   return batch * 2

ds = ray.data.from_pandas(df)
ds = ds.repartition(100)
ds.map_batches(batch_fn)

Does python strings in Pandas dataset considered “Python objects”?
If yes which other string format could be chosen to speed up transfers?

Hi @anatolix thanks for raising the question!

Have you considered what @ericl proposed above? Indeed Ray Datasets is our recommended solution in this case. But I’m curious if you have specific requirements such that Ray Datasets wouldn’t work