How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I want to use Ray Data to do inference, and the format of my data source is tfrecord. The test file contains 600000+ records.
My test script as following:
import ray
ray.init()
ds = ray.data.read_tfrecords("local://xxx-tfrecord.gz, arrow_open_stream_args={“compression”: “gzip”})
print(ds.count())
I found that it takes more than 2 hours to finish. After i added tfx_read_options, it still need 40-50 minutes to finish.
import ray
ray.init()
ds = ray.data.read_tfrecords("local://xxx-tfrecord.gz, arrow_open_stream_args={“compression”: “gzip”}, tfx_read_options=TFXReadOptions())
print(ds.count())
However, i wrote a single program like below
import tensorflow as tf
records = tf.data.TFRecordDataset(“./xxx-tfrecord.gz”, compression_type=‘GZIP’)
count = 0
for record in records:
count = count + 1
print(count)
It was very fast, maybe 5-10 seconds. So i’m wondering why Ray Data takes so long?