I would like to use sklearn pipeline with Ray cluster to make computation paralel. I found example Distributed Scikit-learn / Joblib — Ray 2.0.0.dev0
I try code below but it doesn’t work paralelly:
import joblib
from ray.util.joblib import register_ray
register_ray()
with joblib.parallel_backend('ray'):
df = pd.read_csv(filepath, sep=sep, encoding=encoding, on_bad_lines='skip', low_memory=False)
y = df.pop('target')
X = df.copy()
out= pipe.fit_transform(X, y)
If I use import modin.pandas as pd
the fit method shows problem that X,y are not pandas dataframe types