How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I’m currently testing Ray Data (Ray version 2.2.0) parquet functionalities for converting csv datasets that I have stored in a private S3 bucket I created. For my environment, I am using ray version 2.2.0 and I have a remote ray cluster deployed on Kubernetes. I used ray job submit
to submit the following python script:
import ray
from pyarrow import fs
ray.init()
s3 = fs.S3FileSystem(access_key="ACCESS_KEY", secret_key="SECRET_KEY", region="us-east-2")
if ray.is_initialized():
print("Start reading csv")
ds = ray.data.read_csv("s3://test-bucket/test-dataset1/276mb-dataset.csv", filesystem=s3).lazy().repartition(150)
print("Done reading csv")
ds.write_parquet("s3://test-bucket/test-dataset1/ray/", filesystem=s3, try_create_dir=True)
print("Done writing to S3")
ray.shutdown()
The job runs fine when it reads the csv file from S3, but during the write_parquet
call, I get the following error:
Traceback (most recent call last):
File "ray-test-job.py", line 12, in <module>
ds.write_parquet("s3://test-bucket/test-dataset1/ray/", filesystem=s3, try_create_dir=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 2205, in write_parquet
**arrow_parquet_args,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 2566, in write_datasource
_wrap_arrow_serialization_workaround(write_args),
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2309, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_do_write() (pid=1555, ip=16.0.222.204)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 4297, in _do_write
return ds.do_write(blocks, meta, ray_remote_args=ray_remote_args, **write_args)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/file_based_datasource.py", line 282, in do_write
filesystem.create_dir(tmp, recursive=True)
File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.create_dir
File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: When creating key 'test-dataset1/' in bucket 'test-bucket': AWS Error [code 100]: Unable to parse ExceptionName: InvalidRequest Message: Content-MD5 OR x-amz-checksum- HTTP header is required for Put Object requests with Object Lock parameters
---------------------------------------
Job 'raysubmit_GSTUn5DrAFgy2ePs' failed
---------------------------------------
Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2309, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_do_write() (pid=1555, ip=16.0.222.204)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 4297, in _do_write
return ds.do_write(blocks, meta, ray_remote_args=ray_remote_args, **write_args)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/file_based_datasource.py", line 282, in do_write
filesystem.create_dir(tmp, recursive=True)
File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.create_dir
File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: When creating key 'test-dataset1/' in bucket 'test-bucket': AWS Error [code 100]: Unable to parse ExceptionName: InvalidRequest Message: Content-MD5 OR x-amz-checksum- HTTP header is required for Put Object requests with Object Lock parameters
I am not entirely sure why this error occurs, I passed in my AWS credentials when creating the S3FileSystem object. I also verified that I can access the bucket through the AWS CLI. Can anyone give me an idea why this could be happening? What is the solution to resolving this? Thank you for your time!