What is the difference between Ray and Spark?

What are the difference between Ray and Spark in terms of performance, ease of use, and applicability?
Which one should I use (or is suggested to use) for a machine learning task (based on Isolation Forest) on a very large number of samples?

1 Like

Even though I use ray for rllib I’ll try to help. I suggest using spark mainly because it is well optimized for database operations and can optimize database queries by reordering operations. Because your use case is Isolation Forest and you have a big amount of data, Spark can and should apply many optimizations.
You probably could do similar optimizations in ray but i suspect it takes you more effort.

I hope that helps. If someone else has additions,…

1 Like

You can think Ray is more lower level distributed execution engine than a Spark. RayDP (Spark on Ray) — Ray v1.2.0 For example, it is possible to run Spark on top of Ray.

That says, Ray has more flexibility to implement various distributed systems code. It has very similar programmings style as a single process program (function & class which are corresponding to task and actor).

Usually, how Ray achieved the ease of use is to develop other libraries on top of it. For example, rllib, tune, or ray serve are all implemented on top of Ray and they provide their own high level APIs.

Spark is more specialized systems. As @Sertingolix mentioned, it has higher level APIs.

Also, @rliaw do you know what’s the best way to achieve what he wants to do using Ray?

2 Likes

But we miss the opportunity to use scikit-learn, DASK, NetworkX and other Python libs, am I right?

You can use PySpark and depending on your usecase you can use the the mllib provided by spark. Other python modules can also be used but may not take use of the cluster setup.

As a postprocessing step you can definitely use all of them. If you want to experiment a lot and want fine grained access you should use ray as @sangcho proposed. Especially if each sample is relatively small and samples do not depend on each other you can more easily parallelize, but obtaining global properties becomes harder.

Do you need global properties (over all samples)?

1 Like

I am new to RAY and Spark. Actually, I want to extract some features (such as clustering coefficient, page rank, etc) based on the graph nodes and edges for each edge in a big graph and then train a machine learning model on the extracted vectors corresponding to each edge.
My graph-embedding task as well as my machine learning task are subject to big data, since the graph has many edges, in order of billions.

1 Like

(As a reference)

Ray is python native, and all of python libraries are usable. For Dask and Scikit learn, there are alreay ways to use with Ray;

Dask on Ray; Dask on Ray — Ray v1.2.0
Sckit-learn on Ray: Distributed Scikit-learn / Joblib — Ray v2.0.0.dev0

There’s no API to distribute NetworkX , but you can definitely use it with Ray abstraction.

Please keep the conversation btw :)!

2 Likes

Here’s a YouTube video talking about some differences.

3 Likes

In a nutshell, Ray is the async execution alternative to the sync distributed-execution Spark engine.

3 Likes

This question was asked in 2021, and lots has changed since then.

Now that Ray has the “Ray Data” library built on top of Ray, perhaps the more relevant comparison is how Spark compares to Ray Data.

At a high-level I would say

  • Spark is more for tabular data, Ray is more for unstructured / multimodal data
  • Spark is built for CPUs, Ray is built for heterogeneous compute (especially mixed CPUs / GPUs)
  • Spark is designed for SQL / analytics, Ray is designed for AI

So, if you are running a SQL-like workload or tabular data analytics on CPUs, then Spark is a great choice. If you are doing batch inference or AI on multimodal data (video, images, text, etc) on CPUs & GPUs, then Ray will be a great choice.

At a technical level, Ray adopts a “streaming batch” approach (see this paper), where batches are the unit of processing and batches are streamed through different stages in the pipeline. This avoids the need to materialize all the data in memory (like Spark does) between stages and allows for pipelining between stages that use different resource types (e.g., CPUs and GPUs). There are all sorts of interesting subtleties here, e.g., the appropriate batch size for a CPU stage and GPU stage may be different (perhaps the CPU stage is memory bound and needs a small batch size and the GPU stage is compute bound and requires a larger batch size).