Why Ray Data read tfrecord so slow

yifan_xie · May 9, 2024, 7:26am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I want to use Ray Data to do inference, and the format of my data source is tfrecord. The test file contains 600000+ records.

My test script as following:

import ray

ray.init()

ds = ray.data.read_tfrecords("local://xxx-tfrecord.gz, arrow_open_stream_args={“compression”: “gzip”})
print(ds.count())

I found that it takes more than 2 hours to finish. After i added tfx_read_options, it still need 40-50 minutes to finish.

import ray

ray.init()

ds = ray.data.read_tfrecords("local://xxx-tfrecord.gz, arrow_open_stream_args={“compression”: “gzip”}, tfx_read_options=TFXReadOptions())
print(ds.count())

However, i wrote a single program like below

import tensorflow as tf

records = tf.data.TFRecordDataset(“./xxx-tfrecord.gz”, compression_type=‘GZIP’)
count = 0
for record in records:
count = count + 1
print(count)

It was very fast, maybe 5-10 seconds. So i’m wondering why Ray Data takes so long?

Yan_Li · January 2, 2025, 9:27am

Hello, may I ask if you have found the reason for this problem? I also encountered a similar problem of using ray.data.read_tfrecords, and training with TensorflowTrainer, while using train_data_tf = train_data.to_tf, according to the monitoring display, The ReadTFRecord takes a lot of CPU, over a few hours, but it’s really only a few gigabytes of gzip files

rliaw · March 6, 2025, 10:56pm

Could you guys share some sort of reproduction? I’d be curious to try to reporduce this.

Topic		Replies	Views
TFRecordDataset -> ray.data.Dataset for TensorflowTrainer Ray Data	7	1235	August 12, 2022
Migrating from TFRecords to ray.Data Ray Data	2	529	February 14, 2023
Ray Data streaming not streaming smoothly Ray Data	8	756	May 30, 2023
Ray train tensorflowtrainer look slower than than (normal pandas and tensorflow) i.e without using distribution training or any framework Ray Train	2	693	April 13, 2023
How to use ray.data.Dataset.write_tfrecords to write tfrecord files instead of tar file? Ray Data	1	14	August 22, 2024

Why Ray Data read tfrecord so slow

Related topics