Scikit Learn Distributed support for Ray Train

Sameer_Memon · May 8, 2023, 5:50am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I am fairly new to Ray Train library and looking into it for a Time Series Forecasting task. On researching I have found that it doesn’t support the Scikit Learn libraries for distributed computing. I was wondering what the reasoning was and if there was going to be support in the future for the same.

Any help would be appreciated.

Yard1 · May 8, 2023, 4:51pm

Can you explain what you mean by “Scikit Learn libraries for distributed computing”? What libraries in particular are you talking about?

Base scikit-learn is supported in Train through SklearnTrainer - Training a model with Sklearn — Ray 2.4.0

Sameer_Memon · May 9, 2023, 6:36am

Hi @Yard1! I’m attaching an image below from the Ray AIR documentation which states that the Scikit Learn Trainer isn’t distributed. I’m also attaching the link for the same. If you could please explain that to me it would be really helpful.

https://docs.ray.io/en/latest/ray-air/trainers.html#air-trainers-other

Yard1 · May 9, 2023, 6:53am

The trainer is not conducting distributed training because scikit-learn does not contain any implementations that support distributed training. The only form of parallelism in scikit-learn is entirely limited to one node and depends on the algorithm in question. While it is technically possible to distribute joblib (the parallel backed some scikit-learn algorthns use, eg. random forest) with the Ray backend, usually this does not bring much if any performance benefits, and often causes a dramatically higher degree of memory usage.

In other words, scikit-learn would need to implement data parallel (as seen with eg. distributed Xgboost or PyTorch) or model parallel strategies (eg. PyTorch) for us to actually be able to distribute it.

Sameer_Memon · May 9, 2023, 7:09am

Thank you @Yard1, that makes a lot of sense and puts things in perspective. I’m actually working on a time series & Demand Forecasting problem and trying to create a pipeline as scalable as possible with Ray. Given my limited knowledge in Ray, I’m stil figuring out the best possible ways that I can use the available AIR libraries. I’m looking to use the ARIMA model from statsmodels. Any way I can distribute the training on that. Any other suggestions on the problem statement will also be greatly appreciated. TIA

Yard1 · May 15, 2023, 5:17pm

I’d check out GitHub - Nixtla/statsforecast: Lightning ⚡️ fast forecasting with statistical and econometric models., it has a Ray integration - that being said, the ARMIA algorithm itself is not possible to be distributed (I think there was a paper about a distributed implementation, but I haven’t seen any library implement it), so the only way to parallelize it is to simply create multiple time series in parallel

Topic		Replies	Views
Is sklearn distributed pipeline fit possible in Ray? Ray Core	0	312	March 15, 2022
About the Ray Train category Ray Train	0	790	August 29, 2021
Model Parallelism in Ray Ray Train	9	2994	November 18, 2023
Distributed data loading using Ray Data with XGBoost official (or XGBoost Sklearn) model	1	313	August 26, 2022
Ray multiprocessing together with distributed learning Ray Train	1	556	March 2, 2022

Scikit Learn Distributed support for Ray Train

Related topics