What is the difference between Ray and Spark?

What are the difference between Ray and Spark in terms of performance, ease of use, and applicability?
Which one should I use (or is suggested to use) for a machine learning task (based on Isolation Forest) on a very large number of samples?

1 Like

Even though I use ray for rllib I’ll try to help. I suggest using spark mainly because it is well optimized for database operations and can optimize database queries by reordering operations. Because your use case is Isolation Forest and you have a big amount of data, Spark can and should apply many optimizations.
You probably could do similar optimizations in ray but i suspect it takes you more effort.

I hope that helps. If someone else has additions,…

1 Like

You can think Ray is more lower level distributed execution engine than a Spark. RayDP (Spark on Ray) — Ray v1.2.0 For example, it is possible to run Spark on top of Ray.

That says, Ray has more flexibility to implement various distributed systems code. It has very similar programmings style as a single process program (function & class which are corresponding to task and actor).

Usually, how Ray achieved the ease of use is to develop other libraries on top of it. For example, rllib, tune, or ray serve are all implemented on top of Ray and they provide their own high level APIs.

Spark is more specialized systems. As @Sertingolix mentioned, it has higher level APIs.

Also, @rliaw do you know what’s the best way to achieve what he wants to do using Ray?

2 Likes

But we miss the opportunity to use scikit-learn, DASK, NetworkX and other Python libs, am I right?

You can use PySpark and depending on your usecase you can use the the mllib provided by spark. Other python modules can also be used but may not take use of the cluster setup.

As a postprocessing step you can definitely use all of them. If you want to experiment a lot and want fine grained access you should use ray as @sangcho proposed. Especially if each sample is relatively small and samples do not depend on each other you can more easily parallelize, but obtaining global properties becomes harder.

Do you need global properties (over all samples)?

1 Like

I am new to RAY and Spark. Actually, I want to extract some features (such as clustering coefficient, page rank, etc) based on the graph nodes and edges for each edge in a big graph and then train a machine learning model on the extracted vectors corresponding to each edge.
My graph-embedding task as well as my machine learning task are subject to big data, since the graph has many edges, in order of billions.

1 Like

(As a reference)

Ray is python native, and all of python libraries are usable. For Dask and Scikit learn, there are alreay ways to use with Ray;

Dask on Ray; Dask on Ray — Ray v1.2.0
Sckit-learn on Ray: Distributed Scikit-learn / Joblib — Ray v2.0.0.dev0

There’s no API to distribute NetworkX , but you can definitely use it with Ray abstraction.

Please keep the conversation btw :)!

2 Likes

Here’s a YouTube video talking about some differences.

3 Likes

In a nutshell, Ray is the async execution alternative to the sync distributed-execution Spark engine.

3 Likes