How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I have to calculate N metrics on a Pandas dataframe (accuracy, roc_auc, PR-curve etc). Ideally I would like each metric to be calculated in parallel using a different Ray actor/task. The dataframe itself can be large (10MM rows).
What is the best way to achieve this via Ray actors or tasks? Do I copy the dataframe to the object store via
At the moment, I am only using a single-node cluster, but hope to scale to multi-node clusters soon.
ray.put() to store the dataframe in the object store, or passing the dataframe as arg in
.remote() calls, should allow sharing the dataframe between different Ray tasks / actors with zero copy. The only caveat is that the dataframe cannot contain Python objects.
Consider using Datasets, which can load dataframes via
ray.data.from_pandas(pandas_df). For example, to load a dataframe from pandas and split it into 100 blocks for parallel computation, do:
def batch_fn(batch: pd.DataFrame):
return batch * 2
ds = ray.data.from_pandas(df)
ds = ds.repartition(100)
Does python strings in Pandas dataset considered “Python objects”?
If yes which other string format could be chosen to speed up transfers?