Hi, I’m using Ray with pandarallel to process multiple data.
Here’s the example:
import math
import ray
import numpy as np
import pandas as pd
from pandarallel import pandarallel
def calc(x):
# import math
return math.sin(x.a**2) + math.sin(x.b**2)
@ray.remote
def func(df):
# pandarallel.initialize(nb_workers=2)
return df.parallel_apply(calc, axis=1)
def main():
df_size = int(5e3)
df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
b=np.random.rand(df_size)))
df_list = [df, df]
futures = [func.remote(df) for df in df_list]
res = ray.get(futures)
print(res)
if __name__ == '__main__':
ray.init(num_cpus=2)
main()
It would raise this error:
Traceback (most recent call last):
File "/home/xin/Documents/github/check_s5p_lnox/test_pandarallel.py", line 29, in <module>
main()
File "/home/xin/Documents/github/check_s5p_lnox/test_pandarallel.py", line 22, in main
res = ray.get(futures)
File "/home/xin/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/xin/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1713, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::func() (pid=44305, ip=10.120.12.126)
File "/home/xin/Documents/github/check_s5p_lnox/test_pandarallel.py", line 14, in func
return df.parallel_apply(calc, axis=1)
File "/home/xin/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py", line 5487, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'parallel_apply'
After adding pandarallel.initialize(nb_workers=2)
and import math
in the functions, it works. But I suppose multiple initializations would take much more memory. Is it possible to initialize in the main function?
Any idea about the correct way?
Thanks!