How to use Ray to parallelize splitting a cell?

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hi Ray community,

I am learning Ray for a parallel data processing project at university. I have a large dataframe (40GB). I want to split the cells in one column into two cells. I am connecting two nodes.
At the split stage, my computer cannot finishing processing the task, the kernel always dies at some point. I am just wondering if there is anyway that the code can be modified to utilize Ray’s parallel processing advantage?

Thank you very much!

My code is as follows:

import ray
import modin.pandas as pd

ray.init(address="10.203.81.23:6379")

df1 = pd.read_csv('./transactions_1.csv')

df2 = pd.read_csv('./transactions_2.csv')

df = df1.append(df2)

df[['date', 'time']] = df['timestamp'].str.split(' ', expand = True)

It seems like Modin has a way to support reading multiple csvs in parallel: Read multiple CSV files - #4 by EvanZ - General Questions - Modin Discuss

You could also consider using ray.data.read_csv(['file1.csv', 'file2.csv', ...]) instead if that doesn’t work, followed by ds.map_batches(splitter_fn).

Thank you @ericl for your reply.
For the split step, is there anyway to speed it up in parallel way?

df[['date', 'time']] = df['timestamp'].str.split(' ', expand = True)