Ray is not meant as general ETL tool

Patrick_S · March 20, 2023, 7:12pm

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

In the documentation and the FAQ I see comments like the subject line that Ray is not meant for general ETL, not meant to replace Spark, etc…, but after reading through most of the documentation and starting to use the library I don’t understand why this is being so often repeated. There are never any reasons associated with these statements and the lack of rationale for those statements makes me think they should either not be in technical documentation or the underlying reasons for the statements should be provided. Is it because Datasets only recently could spill to disk? Is it the lack of connectors to databases and easy loading into a Dataframe? Or is it just because that’s not the use case the library was designed to handle and so it has no fundamental basis in the architecture per say but is just meant to be a statement of fact about intention and motivation for building Ray?

Thanks,
Patrick

Jules_Damji · March 20, 2023, 11:04pm

In the documentation and the FAQ I see comments like the subject line that Ray is not meant for general ETL, not meant to replace Spark, etc…, but after reading through most of the documentation and starting to use the library I don’t understand why this is being so often repeated.

Ray is a general purpose framework and not tied to a specific logical data structure such as a DataFrame (spark, pandas, Dask). It does not have a high-level domain specific language (DSL) suitable and preferable for doing EDA and ETL on this logical data structure. Spark was meant to address that workload (for the SQL and streaming) using this DataFrame (built atop RDD); Ray is not suitable for this workload.

There are never any reasons associated with these statements and the lack of rationale for those statements makes me think they should either not be in technical documentation or the underlying reasons for the statements should be provided.

Building scalable data pipelines (ETL workloads that require SQL and streaming) falls into realm of Spark. Scalable machine learning workloads (distributed training, many models training/tuning, batch inference at scale, many models serving, reinforcement learning) fall under the realm of Ray.

Is it because Datasets only recently could spill to disk? Is it the lack of connectors to databases and easy loading into a Dataframe?

Ray data is meant for last mile ML preprocessing for data ingestion to your downstream ML training, tuning or serving. You can write basic preprocessor or transformations (think of how PyTorch provided transformation for your tensors).

More and more connectors with Ray data are on the road map to extract or read data of these data stores. But it does not have a full fledged DSL API that Spark DF or Dask DF affords. You can do basic map, batch_iter, filter, groupby, etc operations on Ray data, not the full blown API accorded to DataFrame. For that you can use Modin on Ray for large Pandas DF extending over the cluster.

Or is it just because that’s not the use case the library was designed to handle and so it has no fundamental basis in the architecture per say but is just meant to be a statement of fact about intention and motivation for building Ray?
Ray was built as a general purpose library, with primitive abstractions for you to write distributed applications. It was not meant to do ETL at scale; building scalable data pipelines at scale.

I think of a Ray developer’s journey begins where all your data has been cleansed, put into data store (Lakehouse or Snowflake) and ready to be used for training. Fundamentally, ETL (and EDA) is a different workload, and never meant for Ray.

Patrick_S · March 21, 2023, 11:43am

Thanks, great answer!

Jules_Damji · March 21, 2023, 1:35pm

You welcome. Shall I close this as resolved?

Patrick_S · March 21, 2023, 2:05pm

Yes, this is resolved.

harelwa · March 25, 2023, 7:33pm

Hi @Jules_Damji

I’ve noticed you did not mention random shuffle of a dataset, which is also “traditionally” done with Spark. In this case, Ray offers a (specialized) alternative. With push based shuffle, even more so.

A penny for your thoughts on that, at least for the sake of completeness of this thread.

Thanks

harelwa · April 11, 2023, 8:11am

fyi –

github.com

marsupialtail/quokka/blob/master/blog/release.md

# Launching Quokka: Distributed Polars on Ray
**Mar-30-2023** 

<div align="center">

[![Join Discord](https://img.shields.io/badge/-Join%20Quokka%20Discord-blue?logo=discord)](https://discord.gg/6ujVV9HAg3)
<a href="https://marsupialtail.github.io/quokka/">
    <img src="https://github.com/marsupialtail/quokka/blob/master/docs/docs/badge.svg" alt="rust docs"/>
</a>
<a href="https://pypi.org/project/pyquokka/">
    <img src="https://img.shields.io/pypi/v/pyquokka.svg" alt="PyPi Latest Release"/>
</a>

</div>


**Fun fact**: 97 years ago today Ikea was founded. I believe the principles of Ikea lends itself well to building modern distributed systems on top of reliable components.


## What is Quokka?

This file has been truncated. show original

@Jules_Damji

Jules_Damji · April 11, 2023, 8:50pm

@harelwa Ray data allows you to global shuffle and per epoch shuffle, more catered for training.

marsupialtail · April 13, 2023, 4:13am

I’m the Quokka author. Thanks for the shoutout @harelwa ! I need to hang out here more often.

Couple points @Patrick_S :

Quokka is built on Ray. It uses Polars/Arrow/DuckDB for single node compute.
It gets competitive performance with SparkSQL on EMR on all of TPC-H except query 21 which I out of memory. In fact for most things I am 3x faster. The github readme is not up to date since the performance is improving every day.
To the above point I don’t yet spill to disk, though work around that is definitely ongoing.
Unfortunately I only support dataframe API right now. SQL is supported but not advertised because it’s easy to shoot yourself in the foot with wierd syntax that might fail my query optimizer.
We are exploring integrating Velox to provide even faster SQL support.

If you have any questions, please let me know!

Jules_Damji · April 13, 2023, 10:39am

@marsupialtail Love to hear your Ray experience. Perhaps, a blog on how you use Ray or a meetup talk. Let me know. Join our Ray Meetup and see how we ask community members to share their use of Ray, whether use cases or how they build frameworks atop Ray.

marsupialtail · April 13, 2023, 1:50pm

I believe I am speaking at Ray summit later this year

Topic		Replies	Views
Can Ray Data be used on Datasets that don't fit in memory?	1	360	January 18, 2024
Using ray for data processing	2	692	January 6, 2023
Problem with anything on Ray Ray Data	2	629	April 20, 2022
What is the difference between Ray and Spark?	9	11639	March 5, 2025
Ray snowflake connector Ray Data	3	876	February 9, 2024

Ray is not meant as general ETL tool

Related topics