Ray Train on EKS unable to use Pod Identity to access Storage

I am running KubeRay on AWS EKS and using Pod Identity to assign an IAM role to Ray pods.

Ray Train fails to use the IAM role to gain access to S3 bucket through pyarrow.fs.S3FileSystem. Here is an excerpt of the traceback:

    result = trainer.fit()
  File "/usr/local/lib/python3.10/site-packages/ray/train/base_trainer.py", line 589, in fit
    storage = StorageContext(
  File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 461, in __init__
    self._create_validation_file()
  File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 489, in _create_validation_file
    self.storage_filesystem.create_dir(self.experiment_fs_path)
  File "pyarrow/_fs.pyx", line 603, in pyarrow._fs.FileSystem.create_dir
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When testing for existence of bucket 'earthdaily-epdev-hawkeye': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.

Note that running the following successfully returns the role:

ray job submit -- aws sts get-caller-identity

{
    "UserId": "AROAXILHR6EDJ4ULEWCOH:eks-jakob1-raycluster-5cdabcd0-7d0b-4103-8a7f-ecb44eeec74d",
    "Account": "498970259718",
    "Arn": "arn:aws:sts::498970259718:assumed-role/jakob-test-ec2-role/eks-jakob1-raycluster-5cdabcd0-7d0b-4103-8a7f-ecb44eeec74d"
}

And running this script using boto3 also succeeds on the cluster:

import boto3
c = boto3.client('s3')
print(c.list_objects_v2(Bucket="earthdaily-epdev-hawkeye"))

How can I make pyarrow.fs.S3FileSystem use the IAM role?

1 Like

Any luck on this @jleben? I am suffering from exactly the same issue after I updated ray to the latest version.

I have filed this GitHub issue that goes into much more detail in its description: [Train] Unable to gain long-term access to S3 storage for training state/checkpoints when running on AWS EKS · Issue #50823 · ray-project/ray · GitHub

It also contains a workaround that I am currently using, which is pretty convoluted and I really wish I didn’t need to use.

1 Like

Thanks for documenting this issue so well! I was really struggling to figure out what happened after I updated my AWS EKS cluster.

I just checked out your workaround and I appreciate it, in fact I will probably use it, but damn is it ugly.

Thanks again for you help!