1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.43.0 with v2 api
- Python version: 3.12.6
- OS: Ubuntu
- Cloud/Infrastructure: AWS
- Other libs/tools (if relevant): Pytorch Native - 2.6.0
3. What happened vs. what you expected:
I am reading data present in S3 via boto3 S3 client and s3.getObject() operation. I have isolated the issue which is occurring in s3.getObject().
- Expected: Head node should read data present in S3.
- Actual: Getting this error which is not understandable by seeing the stacktrace:
> (RayTrainWorker pid=2887) 2025-03-18 16:56:19,373 - INFO - Test run in train worker function -- 1
> (RayTrainWorker pid=2887) 2025-03-18 16:56:19,438 - INFO - Found credentials from IAM Role: ray-autoscaler-v1
> (RayTrainWorker pid=2887) inside if, process rank: 0
> 2025-03-18 16:56:19,592 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_e2f38_00000
> Traceback (most recent call last):
> File "/opt/pytorch/lib/python3.12/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
> result = ray.get(future)
> ^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
> return fn(*args, **kwargs)
> ^^^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
> return func(*args, **kwargs)
> ^^^^^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/_private/worker.py", line 2771, in get
> values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/_private/worker.py", line 919, in get_objects
> raise value.as_instanceof_cause()
> ray.exceptions.RayTaskError(TypeError): ray::_Inner.train() (pid=1727, ip=172.31.89.146, actor_id=09ad1365ac9174f007b419b003000000, repr=TorchTrainer)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/tune/trainable/trainable.py", line 330, in train
> raise skipped from exception_cause(skipped)
> File "/opt/pytorch/lib/python3.12/site-packages/ray/air/_internal/util.py", line 107, in run
> self._ret = self._target(*self._args, **self._kwargs)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
> training_func=lambda: self._trainable_func(self.config),
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/train/base_trainer.py", line 881, in _trainable_func
> super()._trainable_func(self._merged_config)
> File "/opt/pytorch/lib/python3.12/site-packages/ray/tune/trainable/function_trainable.py", line 261, in _trainable_func
> output = fn()
> ^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/train/base_trainer.py", line 122, in _train_coordinator_fn
> trainer.training_loop()
> File "/opt/pytorch/lib/python3.12/site-packages/ray/train/data_parallel_trainer.py", line 472, in training_loop
> self._run_training(training_iterator)
> File "/opt/pytorch/lib/python3.12/site-packages/ray/train/data_parallel_trainer.py", line 371, in _run_training
> for training_results in training_iterator:
> ^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/train/trainer.py", line 146, in __next__
> e = skip_exceptions(e)
> ^^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/air/_internal/util.py", line 62, in skip_exceptions
> return skip_exceptions(exc.__cause__)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/ray/air/_internal/util.py", line 65, in skip_exceptions
> new_exc = copy.copy(exc).with_traceback(exc.__traceback__)
> ^^^^^^^^^^^^^^
> File "/usr/local/lib/python3.12/copy.py", line 97, in copy
> return _reconstruct(x, None, *rv)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/usr/local/lib/python3.12/copy.py", line 253, in _reconstruct
> y = func(*args)
> ^^^^^^^^^^^
> File "/opt/pytorch/lib/python3.12/site-packages/botocore/exceptions.py", line 28, in _exception_from_packed_args
> return exception_cls(*args, **kwargs)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> TypeError: RayTaskError._make_normal_dual_exception_instance.<locals>.cls.__init__() missing 1 required positional argument: 'cause'
One can reproduce the error using below code:
def read_from_s3(bucket_name, file_key):
import boto3
# Initialize a session using Amazon S3
s3 = boto3.client('s3')
response = s3.get_object(Bucket=bucket_name, Key=file_key)
file_content = response['Body'].read().decode('utf-8')
return file_content
def train_func_per_worker(config: dict):
logger.info(f"Reading data from S3...")
if ray.train.get_context().get_local_rank() == 0:
text = read_from_s3(config["bucket_name"], config["file_key"])
torch.distributed.barrier()