ray.train.base_trainer.TrainingFailedError: The Ray Train run failed

Osman · July 11, 2023, 7:34pm

Hello!
I am attempting to use the SklearnTrainer provided by the Ray library to train a machine learning model. However, when calling the fit method on the trainer object, an error is raised. It’s worth mentioning that prior to this snippet, I did not encounter any errors.

Any suggestions to solve it?
Thanks

Code snippet:

trainer = SklearnTrainer(
estimator=RandomForestRegressor(),
label_column=“label”,
scaling_config=ray.air.config.ScalingConfig(
trainer_resources={“CPU”: 4}
)
, datasets={“train”: train_dataset, “test”: test_dataset}
, cv=cv
, parallelize_cv=True
, scoring=scoring
)

result = trainer.fit()

Error message:
An error was encountered:
The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: trainer = SklearnTrainer.restore("/home/ray_results/SklearnTrainer_2023-07-11_11-13-19").
To start a new run that will retry on training failures, set air.RunConfig(failure_config=air.FailureConfig(max_failures)) in the Trainer’s run_config with max_failures > 0, or max_failures = -1 for unlimited retries.
Traceback (most recent call last):
File “/home/hadoop/venv/lib64/python3.7/site-packages/ray/train/base_trainer.py”, line 618, in fit
) from result.error
ray.train.base_trainer.TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: trainer = SklearnTrainer.restore("/home/ray_results/SklearnTrainer_2023-07-11_11-13-19").
To start a new run that will retry on training failures, set air.RunConfig(failure_config=air.FailureConfig(max_failures)) in the Trainer’s run_config with max_failures > 0, or max_failures = -1 for unlimited retries.

mask · July 30, 2023, 1:15am

I am having the same problem in windows 11, python 3.10.11 venv.

mentor_ai · October 13, 2023, 4:31am

I am having the same problem in Windows 10, python 3.8, conda env.

Topic		Replies	Views
Can I catch the original error in code outside train_func? Ray Train	5	307	November 30, 2023
Ray tune trials fail due to unexpected worker exit Ray Train	1	312	April 1, 2024
RecursionError: maximum recursion depth exceeded while calling a Python object Ray Train	2	1685	November 24, 2022
Error When Trying to Tune a Trainable Function	8	2557	August 29, 2023
Ray Train V2 with Ray Tune does not start another trial after a training run is TERMINATED Ray Train	3	21	April 17, 2025

ray.train.base_trainer.TrainingFailedError: The Ray Train run failed

Related topics