Cannot read parquet from S3

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a small script to test loading parquet data from S3. I wrote the data to S3 using Ray’s write_parquet function.

src_fs = arrow_fs.S3FileSystem(**vars(source_s3_cfg))
DATASET_PATH = cfg.dataset.s3_from_path
ds = ray.data.read_parquet(paths=DATASET_PATH, filesystem=src_fs)
for batch in ds.iter_torch_batches():
    pass

But this leads to the error:

(_fetch_metadata_serialization_wrapper pid=9390) 2022-10-17 10:39:56,188        INFO worker.py:763 -- Task failed with retryable exception: TaskID(ba47b759362fe699ffffffffffffffffffffffff01000000).
(_fetch_metadata_serialization_wrapper pid=9390) Traceback (most recent call last):
(_fetch_metadata_serialization_wrapper pid=9390)   File "python/ray/_raylet.pyx", line 859, in ray._raylet.execute_task
(_fetch_metadata_serialization_wrapper pid=9390)   File "python/ray/_raylet.pyx", line 863, in ray._raylet.execute_task
(_fetch_metadata_serialization_wrapper pid=9390)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 410, in _fetch_metadata_serialization_wrapper
(_fetch_metadata_serialization_wrapper pid=9390)     return _fetch_metadata(pieces)
(_fetch_metadata_serialization_wrapper pid=9390)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 419, in _fetch_metadata
(_fetch_metadata_serialization_wrapper pid=9390)     piece_metadata.append(p.metadata)
(_fetch_metadata_serialization_wrapper pid=9390)   File "pyarrow/_dataset.pyx", line 1315, in pyarrow._dataset.ParquetFileFragment.metadata.__get__
(_fetch_metadata_serialization_wrapper pid=9390)   File "pyarrow/_dataset.pyx", line 1304, in pyarrow._dataset.ParquetFileFragment.ensure_complete_metadata
(_fetch_metadata_serialization_wrapper pid=9390)   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
(_fetch_metadata_serialization_wrapper pid=9390) OSError: Could not open Parquet input source 'datasets-v2/150GB/61c485076b6c49df85f2fabfe4e2bacb_001466.parquet': AWS Error [code 99]: curlCode: 18, Transferred a partial file

Has anyone experienced this?
If a Ray employee DMs me, happy to provide a reproduction script + the relevant credentials.

But I’m mostly hoping that this is a common error people have encountered before here.

hi @Vedant_Roy can you PM me on Ray slack to reproduce ? I haven’t seen this error before and it didn’t seem relevant to dataset, much more like corrupted file error to me.

btw we are Ray OSS engineers :slight_smile:

Hi Jiao,

I think the issue is something to do with pyarrow interacting with Cloudflare’s R2, because when I use true-S3, there’s no real issue. So, I suspect the bug is either in pyarrow or Cloudflare R2, and not really in Ray core.

I filed an issue in pyarrow.