InvalidRequest Error when writing parquet to private S3 bucket

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi, I’m currently testing Ray Data (Ray version 2.2.0) parquet functionalities for working with large datasets (>= 7GB) that I have stored in a private S3 bucket I created. For my environment, I started up an anyscale cluster and connected to it with ray.init. For accessing my private bucket, I followed the documentation on anyscale (Accessing a Private S3 Bucket | Anyscale Docs). Then, I read the csv file from S3 by doing the following:

ds = ray.data.read_csv("s3://my-private-bucket/test-dataset/test.csv").repartition(400)

Afterwards, I planned on converting this partitioned dataset to parquet files and uploading it to a specific in my S3 bucket:

ds.write_parquet("s3://my-private-bucket/test-dataset/ray-partitions/", try_create_dir=False)

However, when I try to run write_parquet, I get the following error:

(write_block pid=3229) OSError: When uploading part for key 'test-dataset/ray-partitions/ef1b17dcb0d94cf09eec81122cea91d1_000002.parquet' in bucket 'my-private-bucket': AWS Error [code 100]: Unable to parse ExceptionName: InvalidRequest Message: Content-MD5 OR x-amz-checksum- HTTP header is required for Put Part requests with Object Lock parameters

I am not entirely sure why this error occurs, the cluster is configured with the role I specified in the bucket policy. Can anyone give me an idea why this could be happening? What is the solution to resolving this?