In the documentation and the FAQ I see comments like the subject line that Ray is not meant for general ETL, not meant to replace Spark, etc…, but after reading through most of the documentation and starting to use the library I don’t understand why this is being so often repeated.
Ray is a general purpose framework and not tied to a specific logical data structure such as a DataFrame (spark, pandas, Dask). It does not have a high-level domain specific language (DSL) suitable and preferable for doing EDA and ETL on this logical data structure. Spark was meant to address that workload (for the SQL and streaming) using this DataFrame (built atop RDD); Ray is not suitable for this workload.
There are never any reasons associated with these statements and the lack of rationale for those statements makes me think they should either not be in technical documentation or the underlying reasons for the statements should be provided.
Building scalable data pipelines (ETL workloads that require SQL and streaming) falls into realm of Spark. Scalable machine learning workloads (distributed training, many models training/tuning, batch inference at scale, many models serving, reinforcement learning) fall under the realm of Ray.
Is it because Datasets only recently could spill to disk? Is it the lack of connectors to databases and easy loading into a Dataframe?
Ray data is meant for last mile ML preprocessing for data ingestion to your downstream ML training, tuning or serving. You can write basic preprocessor or transformations (think of how PyTorch provided transformation for your tensors).
More and more connectors with Ray data are on the road map to extract or read data of these data stores. But it does not have a full fledged DSL API that Spark DF or Dask DF affords. You can do basic map, batch_iter, filter, groupby, etc operations on Ray data, not the full blown API accorded to DataFrame. For that you can use Modin on Ray for large Pandas DF extending over the cluster.
Or is it just because that’s not the use case the library was designed to handle and so it has no fundamental basis in the architecture per say but is just meant to be a statement of fact about intention and motivation for building Ray?
Ray was built as a general purpose library, with primitive abstractions for you to write distributed applications. It was not meant to do ETL at scale; building scalable data pipelines at scale.
I think of a Ray developer’s journey begins where all your data has been cleansed, put into data store (Lakehouse or Snowflake) and ready to be used for training. Fundamentally, ETL (and EDA) is a different workload, and never meant for Ray.