How to use Ray to parallelize splitting a cell?

sh2022515 · May 28, 2022, 7:20am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Hi Ray community,

I am learning Ray for a parallel data processing project at university. I have a large dataframe (40GB). I want to split the cells in one column into two cells. I am connecting two nodes.
At the split stage, my computer cannot finishing processing the task, the kernel always dies at some point. I am just wondering if there is anyway that the code can be modified to utilize Ray’s parallel processing advantage?

Thank you very much!

My code is as follows:

import ray
import modin.pandas as pd

ray.init(address="10.203.81.23:6379")

df1 = pd.read_csv('./transactions_1.csv')

df2 = pd.read_csv('./transactions_2.csv')

df = df1.append(df2)

df[['date', 'time']] = df['timestamp'].str.split(' ', expand = True)

ericl · May 31, 2022, 9:31pm

It seems like Modin has a way to support reading multiple csvs in parallel: Read multiple CSV files - #4 by EvanZ - General Questions - Modin Discuss

You could also consider using ray.data.read_csv(['file1.csv', 'file2.csv', ...]) instead if that doesn’t work, followed by ds.map_batches(splitter_fn).

sh2022515 · June 1, 2022, 8:42am

Thank you @ericl for your reply.
For the split step, is there anyway to speed it up in parallel way?

df[['date', 'time']] = df['timestamp'].str.split(' ', expand = True)

Topic		Replies	Views
Using Ray to divide large list into two types Ray Core	1	241	November 10, 2021
Ray not utilized efficiently Ray Core	3	348	March 15, 2022
The use of python multiprocessing along with Ray Ray Core	8	1516	January 10, 2022
Can I use ray for image processing in python	1	1171	May 2, 2022
Parallelising loops using ray Core Ray Core	1	238	October 12, 2023

How to use Ray to parallelize splitting a cell?

Related topics