Is sklearn distributed pipeline fit possible in Ray?

Peter_Pirog · March 15, 2022, 7:45pm

I would like to use sklearn pipeline with Ray cluster to make computation paralel. I found example Distributed Scikit-learn / Joblib — Ray 2.0.0.dev0

I try code below but it doesn’t work paralelly:

import joblib
from ray.util.joblib import register_ray
register_ray()
with joblib.parallel_backend('ray'):
    df = pd.read_csv(filepath, sep=sep, encoding=encoding, on_bad_lines='skip', low_memory=False)
    y = df.pop('target')
    X = df.copy()
    out= pipe.fit_transform(X, y)

If I use import modin.pandas as pd the fit method shows problem that X,y are not pandas dataframe types

Topic		Replies	Views
Scikit Learn Distributed support for Ray Train Ray Train	5	1191	May 15, 2023
How can I create data transform pipelines with Ray? Ray Data	1	157	April 3, 2024
Deploy sklearn machine learning model to ray cluster on gcp Ray Serve	3	571	April 22, 2021
Distributed data loading using Ray Data with XGBoost official (or XGBoost Sklearn) model	1	313	August 26, 2022
Ray Dataset with Distributed PyTorch Ray Data	1	600	April 22, 2022

Is sklearn distributed pipeline fit possible in Ray?

Related topics