Cannot read parquet from S3

Vedant_Roy · October 17, 2022, 10:44am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have a small script to test loading parquet data from S3. I wrote the data to S3 using Ray’s write_parquet function.

src_fs = arrow_fs.S3FileSystem(**vars(source_s3_cfg))
DATASET_PATH = cfg.dataset.s3_from_path
ds = ray.data.read_parquet(paths=DATASET_PATH, filesystem=src_fs)
for batch in ds.iter_torch_batches():
    pass

But this leads to the error:

(_fetch_metadata_serialization_wrapper pid=9390) 2022-10-17 10:39:56,188        INFO worker.py:763 -- Task failed with retryable exception: TaskID(ba47b759362fe699ffffffffffffffffffffffff01000000).
(_fetch_metadata_serialization_wrapper pid=9390) Traceback (most recent call last):
(_fetch_metadata_serialization_wrapper pid=9390)   File "python/ray/_raylet.pyx", line 859, in ray._raylet.execute_task
(_fetch_metadata_serialization_wrapper pid=9390)   File "python/ray/_raylet.pyx", line 863, in ray._raylet.execute_task
(_fetch_metadata_serialization_wrapper pid=9390)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 410, in _fetch_metadata_serialization_wrapper
(_fetch_metadata_serialization_wrapper pid=9390)     return _fetch_metadata(pieces)
(_fetch_metadata_serialization_wrapper pid=9390)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 419, in _fetch_metadata
(_fetch_metadata_serialization_wrapper pid=9390)     piece_metadata.append(p.metadata)
(_fetch_metadata_serialization_wrapper pid=9390)   File "pyarrow/_dataset.pyx", line 1315, in pyarrow._dataset.ParquetFileFragment.metadata.__get__
(_fetch_metadata_serialization_wrapper pid=9390)   File "pyarrow/_dataset.pyx", line 1304, in pyarrow._dataset.ParquetFileFragment.ensure_complete_metadata
(_fetch_metadata_serialization_wrapper pid=9390)   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
(_fetch_metadata_serialization_wrapper pid=9390) OSError: Could not open Parquet input source 'datasets-v2/150GB/61c485076b6c49df85f2fabfe4e2bacb_001466.parquet': AWS Error [code 99]: curlCode: 18, Transferred a partial file

Has anyone experienced this?
If a Ray employee DMs me, happy to provide a reproduction script + the relevant credentials.

But I’m mostly hoping that this is a common error people have encountered before here.

Jiao_Dong · October 18, 2022, 4:49pm

hi @Vedant_Roy can you PM me on Ray slack to reproduce ? I haven’t seen this error before and it didn’t seem relevant to dataset, much more like corrupted file error to me.

btw we are Ray OSS engineers

Vedant_Roy · October 20, 2022, 6:00am

Hi Jiao,

I think the issue is something to do with pyarrow interacting with Cloudflare’s R2, because when I use true-S3, there’s no real issue. So, I suspect the bug is either in pyarrow or Cloudflare R2, and not really in Ray core.

I filed an issue in pyarrow.

Topic		Replies	Views
AWS InvalidRequest Message when writing parquet to private S3 bucket Ray Data	0	535	February 14, 2023
Cannot use S3 inside of task? Ray Data	4	998	October 19, 2022
Ray Dataset Cannot Read Parquet File Ray Data	1	655	August 1, 2022
InvalidRequest Error when writing parquet to private S3 bucket Ray Data	0	309	February 8, 2023
Ray.data.read_parquet example on azure blob storage not working	0	205	May 10, 2024

Cannot read parquet from S3

Related topics