Hi,
I’m using Ray for distributed training of Hugging face transformers models. Recently I’m getting an error while training my model using trainer.fit.
ata = self._dataset_fetcher.fetch(index) # may raise StopIteration
File “/databricks/python/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py”, line 34, in fetch
2023-06-21 06:33:41,317 ERROR trial_runner.py:1062 -- Trial HuggingFaceTrainer_77970_00000: Error processing event.
ray.exceptions.RayTaskError(AttributeError): ray::_Inner.train() (pid=9689, ip=10.1.235.136, repr=HuggingFaceTrainer)
File "/databricks/python/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 368, in train
raise skipped from exception_cause(skipped)
File "/databricks/python/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(AttributeError): ray::RayTrainWorker.RayTrainWorker_execute() (pid=9797, ip=10.1.235.134, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f35696f0d60>)
File "/databricks/python/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
raise skipped from exception_cause(skipped)
File "/databricks/python/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/ray/train/huggingface/huggingface_trainer.py", line 417, in _huggingface_train_loop_per_worker
trainer.train()
File "/databricks/python/lib/python3.9/site-packages/transformers/trainer.py", line 1527, in train
return inner_training_loop(
File "/databricks/python/lib/python3.9/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/databricks/python/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in _next_
data = self._next_data()
File "/databricks/python/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
d
data.append(next(self.dataset_iter))
File "/databricks/python/lib/python3.9/site-packages/transformers/trainer_pt_utils.py", line 804, in _iter_
for element in self.dataset:
File "/databricks/python/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1344, in _iter_
if self._formatting and (ex_iterable.iter_arrow or self._formatting.format_type == "arrow"):
AttributeError: 'RayDatasetHFIterable' object has no attribute 'iter_arrow'
I’m loading the dataset using “ray.data.from_huggingface” i was able to run the code without any issue but since yesterday this error popped up. i tried loading from paraquet files also. can anyone help me with this issue.
Here is my environment:
Databricks:12.2LTS
ray 2.3.1