Unknown error when reading data from S3

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.43.0 with v2 api
  • Python version: 3.12.6
  • OS: Ubuntu
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant): Pytorch Native - 2.6.0

3. What happened vs. what you expected:
I am reading data present in S3 via boto3 S3 client and s3.getObject() operation. I have isolated the issue which is occurring in s3.getObject().

  • Expected: Head node should read data present in S3.
  • Actual: Getting this error which is not understandable by seeing the stacktrace:
> (RayTrainWorker pid=2887) 2025-03-18 16:56:19,373 - INFO - Test run in train worker function -- 1
> (RayTrainWorker pid=2887) 2025-03-18 16:56:19,438 - INFO - Found credentials from IAM Role: ray-autoscaler-v1
> (RayTrainWorker pid=2887) inside if, process rank: 0
> 2025-03-18 16:56:19,592 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_e2f38_00000
> Traceback (most recent call last):
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
>     result = ray.get(future)
>              ^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
>     return fn(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
>     return func(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/_private/worker.py", line 2771, in get
>     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
>                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/_private/worker.py", line 919, in get_objects
>     raise value.as_instanceof_cause()
> ray.exceptions.RayTaskError(TypeError): ray::_Inner.train() (pid=1727, ip=172.31.89.146, actor_id=09ad1365ac9174f007b419b003000000, repr=TorchTrainer)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/tune/trainable/trainable.py", line 330, in train
>     raise skipped from exception_cause(skipped)
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/air/_internal/util.py", line 107, in run
>     self._ret = self._target(*self._args, **self._kwargs)
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
>     training_func=lambda: self._trainable_func(self.config),
>                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/train/base_trainer.py", line 881, in _trainable_func
>     super()._trainable_func(self._merged_config)
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/tune/trainable/function_trainable.py", line 261, in _trainable_func
>     output = fn()
>              ^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/train/base_trainer.py", line 122, in _train_coordinator_fn
>     trainer.training_loop()
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/train/data_parallel_trainer.py", line 472, in training_loop
>     self._run_training(training_iterator)
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/train/data_parallel_trainer.py", line 371, in _run_training
>     for training_results in training_iterator:
>                             ^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/train/trainer.py", line 146, in __next__
>     e = skip_exceptions(e)
>         ^^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/air/_internal/util.py", line 62, in skip_exceptions
>     return skip_exceptions(exc.__cause__)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/ray/air/_internal/util.py", line 65, in skip_exceptions
>     new_exc = copy.copy(exc).with_traceback(exc.__traceback__)
>               ^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/copy.py", line 97, in copy
>     return _reconstruct(x, None, *rv)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/copy.py", line 253, in _reconstruct
>     y = func(*args)
>         ^^^^^^^^^^^
>   File "/opt/pytorch/lib/python3.12/site-packages/botocore/exceptions.py", line 28, in _exception_from_packed_args
>     return exception_cls(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> TypeError: RayTaskError._make_normal_dual_exception_instance.<locals>.cls.__init__() missing 1 required positional argument: 'cause'

One can reproduce the error using below code:

def read_from_s3(bucket_name, file_key):
    import boto3

    # Initialize a session using Amazon S3
    s3 = boto3.client('s3')

    response = s3.get_object(Bucket=bucket_name, Key=file_key)
    file_content = response['Body'].read().decode('utf-8')

    return file_content
def train_func_per_worker(config: dict):
    logger.info(f"Reading data from S3...")
    if ray.train.get_context().get_local_rank() == 0:
        text = read_from_s3(config["bucket_name"], config["file_key"])
    torch.distributed.barrier()