What is the benefits to integrate Ray and Spark?


I am a fresher to Ray, I used to do some works based on Apache Spark. Recently, I found there are some researchers integrating Spark and Ray for distributed machine learning and I am interested in this.

I want to know if the integration of Spark and Ray realized by Spark by opening multiple independent Ray tasks for each Executor, such as the RayonSpark developed by Intel. And what is the benefit for these works?

Can anyone please provide some thoughts on this?

The main benefit of these projects is allowing you to easily use Spark as a part of your larger Ray program. For example, you could use Spark to load and featurize data and put it into the Ray object store, then feed it into your ML training pipeline natively using Python code instead of materializing to external storage…

@sangcho or @ericl may also like to chime in here with some more details!


There are different options to integrate Ray and Spark. RayonSpark allows you to run Ray program in your existing Spark cluster. Another option is to use RayDP which runs Spark on Ray. RayDP assumes using Ray as the substrate and run Spark and other ML/DL frameworks on top of Ray. It makes it simple to build distributed end-to-end data analytics and AI pipeline by using Spark for data preprocessing and integrating with ML/DL frameworks available on Ray including RaySGD, Horovod, XGBoost, etc with efficient data exchange through Ray’s object store.

1 Like