Why Ray Data read tfrecord so slow

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I want to use Ray Data to do inference, and the format of my data source is tfrecord. The test file contains 600000+ records.

My test script as following:

import ray

ray.init()

ds = ray.data.read_tfrecords("local://xxx-tfrecord.gz, arrow_open_stream_args={“compression”: “gzip”})
print(ds.count())

I found that it takes more than 2 hours to finish. After i added tfx_read_options, it still need 40-50 minutes to finish.

import ray

ray.init()

ds = ray.data.read_tfrecords("local://xxx-tfrecord.gz, arrow_open_stream_args={“compression”: “gzip”}, tfx_read_options=TFXReadOptions())
print(ds.count())

However, i wrote a single program like below

import tensorflow as tf

records = tf.data.TFRecordDataset(“./xxx-tfrecord.gz”, compression_type=‘GZIP’)
count = 0
for record in records:
count = count + 1
print(count)

It was very fast, maybe 5-10 seconds. So i’m wondering why Ray Data takes so long?

Hello, may I ask if you have found the reason for this problem? I also encountered a similar problem of using ray.data.read_tfrecords, and training with TensorflowTrainer, while using train_data_tf = train_data.to_tf, according to the monitoring display, The ReadTFRecord takes a lot of CPU, over a few hours, but it’s really only a few gigabytes of gzip files