Benchmarks for Ray Data?

Similar to my other question regarding benchmarks for Ray Serve, are there benchmarks that have either been published or in the works for Ray Data in comparison to TF Transform, Dataflow, or other preprocessing solutions?

Hi @jinnovation great question and right time. Benchmarking is one of key focus area of Ray AIR w/ Ray Dataset in Ray 2.1 & 2.2 that we’re actively testing and improving on weekly basis.

You should hear from us soon :slight_smile:


Hi @Jiao_Dong, is there any benchmark result of Ray Dataset now?

Hi @jinnovation @loneystar1983 thanks for your interests in Ray Data!

We are actively working on this. For now, we have some results from the two most recent Ray Enhancement Proposals:


Thanks @zhz ! These proposals are greatly helpful, I can preprocess my data more fastly. And more, is there a comparision between Ray Data and Spark DataFrame?

Some benchmark available here! Developer Preview: Ray Data Streaming Execution - Google Docs

We are working on a benchmark study that involves Spark DataFrames and will share the results in 3~5 weeks

@jinnovation Stay tuned. We will be publishing these benchmarks as blogs in the coming weeks. Can you close this issue since we are in the process sharing our findings with the community?