How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I have a small script to test loading parquet data from S3. I wrote the data to S3 using Ray’s write_parquet
function.
src_fs = arrow_fs.S3FileSystem(**vars(source_s3_cfg))
DATASET_PATH = cfg.dataset.s3_from_path
ds = ray.data.read_parquet(paths=DATASET_PATH, filesystem=src_fs)
for batch in ds.iter_torch_batches():
pass
But this leads to the error:
(_fetch_metadata_serialization_wrapper pid=9390) 2022-10-17 10:39:56,188 INFO worker.py:763 -- Task failed with retryable exception: TaskID(ba47b759362fe699ffffffffffffffffffffffff01000000).
(_fetch_metadata_serialization_wrapper pid=9390) Traceback (most recent call last):
(_fetch_metadata_serialization_wrapper pid=9390) File "python/ray/_raylet.pyx", line 859, in ray._raylet.execute_task
(_fetch_metadata_serialization_wrapper pid=9390) File "python/ray/_raylet.pyx", line 863, in ray._raylet.execute_task
(_fetch_metadata_serialization_wrapper pid=9390) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 410, in _fetch_metadata_serialization_wrapper
(_fetch_metadata_serialization_wrapper pid=9390) return _fetch_metadata(pieces)
(_fetch_metadata_serialization_wrapper pid=9390) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 419, in _fetch_metadata
(_fetch_metadata_serialization_wrapper pid=9390) piece_metadata.append(p.metadata)
(_fetch_metadata_serialization_wrapper pid=9390) File "pyarrow/_dataset.pyx", line 1315, in pyarrow._dataset.ParquetFileFragment.metadata.__get__
(_fetch_metadata_serialization_wrapper pid=9390) File "pyarrow/_dataset.pyx", line 1304, in pyarrow._dataset.ParquetFileFragment.ensure_complete_metadata
(_fetch_metadata_serialization_wrapper pid=9390) File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
(_fetch_metadata_serialization_wrapper pid=9390) OSError: Could not open Parquet input source 'datasets-v2/150GB/61c485076b6c49df85f2fabfe4e2bacb_001466.parquet': AWS Error [code 99]: curlCode: 18, Transferred a partial file
Has anyone experienced this?
If a Ray employee DMs me, happy to provide a reproduction script + the relevant credentials.
But I’m mostly hoping that this is a common error people have encountered before here.