Access aws s3 for vllm v0.9+

cnmdestroyer · June 27, 2025, 9:30pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: latest
Python version: 3.12
OS: linux
Cloud/Infrastructure: AWS p4d instances
Other libs/tools (if relevant): vllm v0.9.1

3. What happened vs. what you expected:

Expected: I am able to read/write to S3 if my role is correct
Actual: Has access issue for s3 if use assumed role

this is code structure (all paths are S3 path s3://<bucket>/folder/)：

ds = ray.data.read_parquet(cfg.data_params.input_data_path)
ds = input_data.map_batches(VLLMPreditor, other_args)
ds.write_parquet(cfg.data_params.output_data_path)

it working perfectly using previous vllm version (v0.7/v0.8) and container(python 3.11 based)

after I change to latest container “vllm/vllm-openai:v0.9.1” (torch 2.7 and python 3.12), it complain the access denied error:

OSError: When getting information for key 'xxx/model/input' in bucket 'my-xxx-bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

the role is exactly same compare to previous run, it has enough permissions, I have print the log like

[2025-06-16 13:47:50,114][__main__][INFO] - Verifying AWS role...
[2025-06-16 13:47:50,585][__main__][INFO] - Current role ARN: arn:aws:sts::<my account id>:assumed-role/myRole/xxx

if I manually pass the filesystem, the access issue is gone, code like:

def create_s3_filesystem(region='us-east-1'):
    try:
        import pyarrow as pa
        import pyarrow.fs as pafs
        import boto3
        session = boto3.Session()
        credentials = session.get_credentials()
        
        if not credentials:
            raise Exception("Unable to retrieve AWS credentials")
        
        s3_fs = pafs.S3FileSystem(
            access_key=credentials.access_key,
            secret_key=credentials.secret_key,
            session_token=credentials.token,
            region=region
        )
        
        return s3_fs
        
    except Exception as e:
        raise Exception(f"Failed to create S3 filesystem: {str(e)}")

ds = ray.data.read_parquet(cfg.data_params.input_data_path,filesystem=create_s3_filesystem(region=region))
ds.write_parquet(cfg.data_params.output_data_path,filesystem=create_s3_filesystem(region=region))

but the job will failed after a hour for TokenExpire issue:

OSError: When initiating multiple part upload for key 'xxx/fff.parquet' in bucket 'my-xxx-bucket': AWS Error UNKNOWN (HTTP status 400) during CreateMultipartUpload operation: Unable to parse ExceptionName: ExpiredToken Message: The provided token has expired.

my questions:

Why the ray.data behavior changed for python 3.12 and requires we have to manually pass filesystem, can we avoid it?
Since I am using assumed role, how to always use the latest refreshed token to avoid the token expired error, my job is remote multi node job and need run days.

Any help is appreciated!

raulchen · July 10, 2025, 9:24pm

Ray data uses Pyarrow for IO operations.
PyArrow should automatically detects the credential files or env vars.
I’m not aware of any issue specific to python 3.12. It might also be related to how you set credentials in your container.
I’d suggest try upgrading pyarrow version first.
If it’s still not working. Try using pyarrow directly without ray for easier debugging.

cnmdestroyer · July 10, 2025, 11:29pm

yes, I checked pyarrow version (20.0.0), it’s exactly same in both container.
Not sure if pyarrow’s behavior will different in different python version.

I have tested some code directly using pyarrow:

            logger.info("Testing PyArrow S3FileSystem creation...")
            try:
                import pyarrow.fs as pafs
                # Test default filesystem
                default_fs = pafs.S3FileSystem()
                logger.info(f"Default S3FileSystem created: {type(default_fs)}")
                
                # Test with explicit region
                region_fs = pafs.S3FileSystem(region='us-east-1')
                logger.info(f"Regional S3FileSystem created: {type(region_fs)}")
                
                # Test file info access
                test_path = "my-bucket/my-apth"
                try:
                    file_info = default_fs.get_file_info(test_path)
                    logger.info(f"File info access with default FS: SUCCESS")
                except Exception as e:
                    logger.error(f"File info access with default FS: FAILED - {e}")
                    
                try:
                    file_info = region_fs.get_file_info(test_path)
                    logger.info(f"File info access with regional FS: SUCCESS")
                except Exception as e:

here is the result

for old container (python 3.11 + vllm 0.8+):

[2025-07-01 22:40:13,890][__main__][ERROR] - File info access with default FS: FAILED - When getting information for key '<my-path>' in bucket '<my-bucket>': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body. Looks like the configured region is '' while the bucket is located in 'us-east-1'.
[2025-07-01 22:40:13,990][__main__][INFO] - File info access with regional FS: SUCCESS

for new container (python 3.12 + vllm 0.9+):

[2025-07-01 16:05:57,186][__main__][ERROR] - File info access with default FS: FAILED - When getting information for key '<my-path>' in bucket '<my-bucket>': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body. Looks like the configured region is '' while the bucket is located in 'us-east-1'.
[2025-07-01 16:05:57,255][__main__][ERROR] - File info access with regional FS: FAILED - When getting information for key '<my-path>' in bucket '<my-bucket>': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

I think that’s the cause, but not sure how to mitigate it…

I also verified for package ray, PyArrow, fsspec, boto3, botocore, s3transfer, we are using same version

Topic		Replies	Views
Ray Train on EKS unable to use Pod Identity to access Storage Ray Train	3	77	March 4, 2025
Unknown error when reading data from S3 Ray Train	0	29	March 18, 2025
Cannot use S3 inside of task? Ray Data	4	999	October 19, 2022
AWS InvalidRequest Message when writing parquet to private S3 bucket Ray Data	0	535	February 14, 2023
Ray dataset cannot read and parse image image dataset from S3	12	951	August 14, 2023

Access aws s3 for vllm v0.9+

Related topics