1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: latest
- Python version: 3.12
- OS: linux
- Cloud/Infrastructure: AWS p4d instances
- Other libs/tools (if relevant): vllm v0.9.1
3. What happened vs. what you expected:
- Expected: I am able to read/write to S3 if my role is correct
- Actual: Has access issue for s3 if use assumed role
this is code structure (all paths are S3 path s3://<bucket>/folder/
):
ds = ray.data.read_parquet(cfg.data_params.input_data_path)
ds = input_data.map_batches(VLLMPreditor, other_args)
ds.write_parquet(cfg.data_params.output_data_path)
it working perfectly using previous vllm version (v0.7/v0.8) and container(python 3.11 based)
after I change to latest container “vllm/vllm-openai:v0.9.1” (torch 2.7 and python 3.12), it complain the access denied error:
OSError: When getting information for key 'xxx/model/input' in bucket 'my-xxx-bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
the role is exactly same compare to previous run, it has enough permissions, I have print the log like
[2025-06-16 13:47:50,114][__main__][INFO] - Verifying AWS role...
[2025-06-16 13:47:50,585][__main__][INFO] - Current role ARN: arn:aws:sts::<my account id>:assumed-role/myRole/xxx
if I manually pass the filesystem
, the access issue is gone, code like:
def create_s3_filesystem(region='us-east-1'):
try:
import pyarrow as pa
import pyarrow.fs as pafs
import boto3
session = boto3.Session()
credentials = session.get_credentials()
if not credentials:
raise Exception("Unable to retrieve AWS credentials")
s3_fs = pafs.S3FileSystem(
access_key=credentials.access_key,
secret_key=credentials.secret_key,
session_token=credentials.token,
region=region
)
return s3_fs
except Exception as e:
raise Exception(f"Failed to create S3 filesystem: {str(e)}")
ds = ray.data.read_parquet(cfg.data_params.input_data_path,filesystem=create_s3_filesystem(region=region))
ds.write_parquet(cfg.data_params.output_data_path,filesystem=create_s3_filesystem(region=region))
but the job will failed after a hour for TokenExpire issue:
OSError: When initiating multiple part upload for key 'xxx/fff.parquet' in bucket 'my-xxx-bucket': AWS Error UNKNOWN (HTTP status 400) during CreateMultipartUpload operation: Unable to parse ExceptionName: ExpiredToken Message: The provided token has expired.
my questions:
- Why the ray.data behavior changed for python 3.12 and requires we have to manually pass
filesystem
, can we avoid it? - Since I am using assumed role, how to always use the latest refreshed token to avoid the token expired error, my job is remote multi node job and need run days.
Any help is appreciated!