Scikit Learn Distributed support for Ray Train

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I am fairly new to Ray Train library and looking into it for a Time Series Forecasting task. On researching I have found that it doesn’t support the Scikit Learn libraries for distributed computing. I was wondering what the reasoning was and if there was going to be support in the future for the same.

Any help would be appreciated.

Can you explain what you mean by “Scikit Learn libraries for distributed computing”? What libraries in particular are you talking about?

Base scikit-learn is supported in Train through SklearnTrainer - Training a model with Sklearn — Ray 2.4.0

Hi @Yard1! I’m attaching an image below from the Ray AIR documentation which states that the Scikit Learn Trainer isn’t distributed. I’m also attaching the link for the same. If you could please explain that to me it would be really helpful.


https://docs.ray.io/en/latest/ray-air/trainers.html#air-trainers-other

The trainer is not conducting distributed training because scikit-learn does not contain any implementations that support distributed training. The only form of parallelism in scikit-learn is entirely limited to one node and depends on the algorithm in question. While it is technically possible to distribute joblib (the parallel backed some scikit-learn algorthns use, eg. random forest) with the Ray backend, usually this does not bring much if any performance benefits, and often causes a dramatically higher degree of memory usage.

In other words, scikit-learn would need to implement data parallel (as seen with eg. distributed Xgboost or PyTorch) or model parallel strategies (eg. PyTorch) for us to actually be able to distribute it.

Thank you @Yard1, that makes a lot of sense and puts things in perspective. I’m actually working on a time series & Demand Forecasting problem and trying to create a pipeline as scalable as possible with Ray. Given my limited knowledge in Ray, I’m stil figuring out the best possible ways that I can use the available AIR libraries. I’m looking to use the ARIMA model from statsmodels. Any way I can distribute the training on that. Any other suggestions on the problem statement will also be greatly appreciated. TIA

I’d check out GitHub - Nixtla/statsforecast: Lightning ⚡️ fast forecasting with statistical and econometric models., it has a Ray integration - that being said, the ARMIA algorithm itself is not possible to be distributed (I think there was a paper about a distributed implementation, but I haven’t seen any library implement it), so the only way to parallelize it is to simply create multiple time series in parallel