Ray is not meant as general ETL tool

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

In the documentation and the FAQ I see comments like the subject line that Ray is not meant for general ETL, not meant to replace Spark, etc…, but after reading through most of the documentation and starting to use the library I don’t understand why this is being so often repeated. There are never any reasons associated with these statements and the lack of rationale for those statements makes me think they should either not be in technical documentation or the underlying reasons for the statements should be provided. Is it because Datasets only recently could spill to disk? Is it the lack of connectors to databases and easy loading into a Dataframe? Or is it just because that’s not the use case the library was designed to handle and so it has no fundamental basis in the architecture per say but is just meant to be a statement of fact about intention and motivation for building Ray?

Thanks,
Patrick

2 Likes

In the documentation and the FAQ I see comments like the subject line that Ray is not meant for general ETL, not meant to replace Spark, etc…, but after reading through most of the documentation and starting to use the library I don’t understand why this is being so often repeated.

Ray is a general purpose framework and not tied to a specific logical data structure such as a DataFrame (spark, pandas, Dask). It does not have a high-level domain specific language (DSL) suitable and preferable for doing EDA and ETL on this logical data structure. Spark was meant to address that workload (for the SQL and streaming) using this DataFrame (built atop RDD); Ray is not suitable for this workload.

There are never any reasons associated with these statements and the lack of rationale for those statements makes me think they should either not be in technical documentation or the underlying reasons for the statements should be provided.

Building scalable data pipelines (ETL workloads that require SQL and streaming) falls into realm of Spark. Scalable machine learning workloads (distributed training, many models training/tuning, batch inference at scale, many models serving, reinforcement learning) fall under the realm of Ray.

Is it because Datasets only recently could spill to disk? Is it the lack of connectors to databases and easy loading into a Dataframe?

Ray data is meant for last mile ML preprocessing for data ingestion to your downstream ML training, tuning or serving. You can write basic preprocessor or transformations (think of how PyTorch provided transformation for your tensors).

More and more connectors with Ray data are on the road map to extract or read data of these data stores. But it does not have a full fledged DSL API that Spark DF or Dask DF affords. You can do basic map, batch_iter, filter, groupby, etc operations on Ray data, not the full blown API accorded to DataFrame. For that you can use Modin on Ray for large Pandas DF extending over the cluster.

Or is it just because that’s not the use case the library was designed to handle and so it has no fundamental basis in the architecture per say but is just meant to be a statement of fact about intention and motivation for building Ray?
Ray was built as a general purpose library, with primitive abstractions for you to write distributed applications. It was not meant to do ETL at scale; building scalable data pipelines at scale.

I think of a Ray developer’s journey begins where all your data has been cleansed, put into data store (Lakehouse or Snowflake) and ready to be used for training. Fundamentally, ETL (and EDA) is a different workload, and never meant for Ray.

2 Likes

Thanks, great answer!

You welcome. Shall I close this as resolved?

Yes, this is resolved.

Hi @Jules_Damji

I’ve noticed you did not mention random shuffle of a dataset, which is also “traditionally” done with Spark. In this case, Ray offers a (specialized) alternative. With push based shuffle, even more so.

A penny for your thoughts on that, at least for the sake of completeness of this thread.

Thanks

fyi –

@Jules_Damji

@harelwa Ray data allows you to global shuffle and per epoch shuffle, more catered for training.

I’m the Quokka author. Thanks for the shoutout @harelwa ! I need to hang out here more often.

Couple points @Patrick_S :

  • Quokka is built on Ray. It uses Polars/Arrow/DuckDB for single node compute.
  • It gets competitive performance with SparkSQL on EMR on all of TPC-H except query 21 which I out of memory. In fact for most things I am 3x faster. The github readme is not up to date since the performance is improving every day.
  • To the above point I don’t yet spill to disk, though work around that is definitely ongoing.
  • Unfortunately I only support dataframe API right now. SQL is supported but not advertised because it’s easy to shoot yourself in the foot with wierd syntax that might fail my query optimizer.
  • We are exploring integrating Velox to provide even faster SQL support.

If you have any questions, please let me know!

2 Likes

@marsupialtail Love to hear your Ray experience. Perhaps, a blog on how you use Ray or a meetup talk. Let me know. Join our Ray Meetup and see how we ask community members to share their use of Ray, whether use cases or how they build frameworks atop Ray.

I believe I am speaking at Ray summit later this year