Not able to batch processes that includes S3 operations

Harshal_Mittal · June 19, 2023, 6:37am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a pipeline in which files are downloaded from a remote server and processed on an AWS cluster to derive the output. The configuration of the cluster is 4 Cores, 32G Ram. The whole system runs in a batch where 3 workflows run parallelly.

Use Case

Workflows include: downloading 2 files from the remote server. Processing them. store it on s3.

The whole pipeline seems to work fine for batch size 3. Meaning 6 files can be downloaded parallelly, processed parallelly, and stored on s3.

The problem starts when we increase the batch size if we go to 4. it throws RaySystemError. The workflow failed during execution and it says

S3 subsystem not initialized;

I also want to draw your attention that, the workflows storage path is of an S3 bucket.

ray.init(storage=<path-to-s3>)

Error Tail Logs

  File "pyarrow/_s3fs.pyx", line 214, in pyarrow._s3fs.S3FileSystem._reconstruct
  File "pyarrow/_s3fs.pyx", line 204, in pyarrow._s3fs.S3FileSystem.__init__
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: S3 subsystem not initialized; please call InitializeS3() before carrying out any S3-related operation

Although I can work around this by setting up the batch size limited to 3, but I am looking for some concrete explanation why it worked for batch size 3 and not for more than that.
It might help me fix the issue.

Thanks in Advance.

Topic		Replies	Views
Ray Workflow storage on S3 unexpected behavior Ray Workflows	3	526	December 1, 2022
Cannot use S3 inside of task? Ray Data	4	972	October 19, 2022
Recipe to process a bunch of files Ray Core	1	484	February 21, 2023
Using Ray in AWS Batch in docker containers	0	985	June 23, 2022
Unable to use shared s3 persistent storage for ray workers Ray Tune	0	43	July 30, 2024

Not able to batch processes that includes S3 operations

Use Case

Related topics