AWS InvalidRequest Message when writing parquet to private S3 bucket

NEO · February 14, 2023, 6:56am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi, I’m currently testing Ray Data (Ray version 2.2.0) parquet functionalities for converting csv datasets that I have stored in a private S3 bucket I created. For my environment, I am using ray version 2.2.0 and I have a remote ray cluster deployed on Kubernetes. I used ray job submit to submit the following python script:

import ray
from pyarrow import fs

ray.init()

s3 = fs.S3FileSystem(access_key="ACCESS_KEY", secret_key="SECRET_KEY", region="us-east-2")

if ray.is_initialized():
	print("Start reading csv")
	ds = ray.data.read_csv("s3://test-bucket/test-dataset1/276mb-dataset.csv", filesystem=s3).lazy().repartition(150)
	print("Done reading csv")
	ds.write_parquet("s3://test-bucket/test-dataset1/ray/", filesystem=s3, try_create_dir=True)
	print("Done writing to S3")

ray.shutdown()

The job runs fine when it reads the csv file from S3, but during the write_parquet call, I get the following error:

Traceback (most recent call last):
  File "ray-test-job.py", line 12, in <module>
    ds.write_parquet("s3://test-bucket/test-dataset1/ray/", filesystem=s3, try_create_dir=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 2205, in write_parquet
    **arrow_parquet_args,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 2566, in write_datasource
    _wrap_arrow_serialization_workaround(write_args),
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2309, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_do_write() (pid=1555, ip=16.0.222.204)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 4297, in _do_write
    return ds.do_write(blocks, meta, ray_remote_args=ray_remote_args, **write_args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/file_based_datasource.py", line 282, in do_write
    filesystem.create_dir(tmp, recursive=True)
  File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.create_dir
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: When creating key 'test-dataset1/' in bucket 'test-bucket': AWS Error [code 100]: Unable to parse ExceptionName: InvalidRequest Message: Content-MD5 OR x-amz-checksum- HTTP header is required for Put Object requests with Object Lock parameters

---------------------------------------
Job 'raysubmit_GSTUn5DrAFgy2ePs' failed
---------------------------------------

Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2309, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_do_write() (pid=1555, ip=16.0.222.204)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 4297, in _do_write
    return ds.do_write(blocks, meta, ray_remote_args=ray_remote_args, **write_args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/file_based_datasource.py", line 282, in do_write
    filesystem.create_dir(tmp, recursive=True)
  File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.create_dir
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: When creating key 'test-dataset1/' in bucket 'test-bucket': AWS Error [code 100]: Unable to parse ExceptionName: InvalidRequest Message: Content-MD5 OR x-amz-checksum- HTTP header is required for Put Object requests with Object Lock parameters

I am not entirely sure why this error occurs, I passed in my AWS credentials when creating the S3FileSystem object. I also verified that I can access the bucket through the AWS CLI. Can anyone give me an idea why this could be happening? What is the solution to resolving this? Thank you for your time!

Topic		Replies	Views
InvalidRequest Error when writing parquet to private S3 bucket Ray Data	0	309	February 8, 2023
Cannot read parquet from S3	2	802	October 20, 2022
Cannot use S3 inside of task? Ray Data	4	998	October 19, 2022
Ray.data.read_parquet example on azure blob storage not working	0	205	May 10, 2024
Ray Dataset Cannot Read Parquet File Ray Data	1	655	August 1, 2022

AWS InvalidRequest Message when writing parquet to private S3 bucket

Related topics