When using a single-node cluster, how to efficiently share a dataframe (for read-only access) between ray actors/tasks?

AdamDivekar · May 20, 2022, 6:21am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have to calculate N metrics on a Pandas dataframe (accuracy, roc_auc, PR-curve etc). Ideally I would like each metric to be calculated in parallel using a different Ray actor/task. The dataframe itself can be large (10MM rows).

What is the best way to achieve this via Ray actors or tasks? Do I copy the dataframe to the object store via ray.put()?

At the moment, I am only using a single-node cluster, but hope to scale to multi-node clusters soon.

Thanks,
Abhishek

Mingwei · May 20, 2022, 5:21pm

Yes, using ray.put() to store the dataframe in the object store, or passing the dataframe as arg in .remote() calls, should allow sharing the dataframe between different Ray tasks / actors with zero copy. The only caveat is that the dataframe cannot contain Python objects.

ericl · May 20, 2022, 9:29pm

Consider using Datasets, which can load dataframes via ray.data.from_pandas(pandas_df). For example, to load a dataframe from pandas and split it into 100 blocks for parallel computation, do:

def batch_fn(batch: pd.DataFrame):
   return batch * 2

ds = ray.data.from_pandas(df)
ds = ds.repartition(100)
ds.map_batches(batch_fn)

anatolix · October 6, 2022, 12:42pm

Does python strings in Pandas dataset considered “Python objects”?
If yes which other string format could be chosen to speed up transfers?

zhz · December 2, 2022, 1:11am

Hi @anatolix thanks for raising the question!

Have you considered what @ericl proposed above? Indeed Ray Datasets is our recommended solution in this case. But I’m curious if you have specific requirements such that Ray Datasets wouldn’t work

Topic		Replies	Views
[Core] Ray Cluster with Shared Global Pandas Dataframe Ray Core	3	329	May 21, 2021
How to create a Ray dataset from distributed partitions? Ray Data	7	820	April 5, 2023
Loading dataset once per machine in ray cluster	1	217	December 5, 2023
Working with a large dataset Ray Core	2	1099	December 16, 2021
Accessing Large Static Datasets with Ray Clusters	3	567	May 27, 2023

When using a single-node cluster, how to efficiently share a dataframe (for read-only access) between ray actors/tasks?

Related topics