XGBoostTrainer Warning: Saving into deprecated binary model format

daviddwlee84 · November 23, 2023, 6:36am

There are warning messages that keep spamming during the training progress when using XGBoostTrainer + RunConfig & CheckpointConfig.

WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using json or ubj. Model format will default to JSON in XGBoost 2.2 if not specified.

Is there any argument to pass to specify the model format to suppress this warning?

daviddwlee84 · November 24, 2023, 1:36am

[BUG] mlflow.xgboost.load_model: Warning and Missing Parameters using xgboost 2.0.0 · Issue #9659 · mlflow/mlflow
[Roadmap] Phasing out the support for old binary format. · Issue #7547 · dmlc/xgboost

matthewdeng · November 28, 2023, 1:33am

Hey @daviddwlee84 , can you share which version of Ray you are using? I believe that XGBoostTrainer should be saving to json format already.

Related: Saving XGBoost model with json extension · Issue #41374 · ray-project/ray · GitHub

daviddwlee84 · December 18, 2023, 3:00am

Sorry for the late reply, might be missing the notification.
I am using Ray 2.8.1, which should be using the json format.

I have raised the issue in GitHub.
It is kind of weird, I haven’t found the root cause. Seems only I can reproduce the issue.

=> No matter what I specify the XGBoostTrainer._save_model(), somehow it will call the legacy version of booster.save_model() again somewhere, and the checkpoint contents are not successfully copied from the temp directory to where it should be persisted on NAS. (In Ray 2.8+ will use ray.train.report to copy to persistent storage.

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed · Issue #41608 · ray-project/ray

daviddwlee84 · December 19, 2023, 2:48am

The root cause is the XGBoostTrainer’s default _tune_callback_checkpoint_cls is using MODEL_KEY which is model and not aligned with XGBoostCheckpoint’s model.json.

Here is the workaround I found.

github.com/ray-project/ray

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed

opened 03:40AM - 05 Dec 23 UTC

daviddwlee84

bug P1 train

### What happened + What you expected to happen When I finish XGBoost trainin…g using XGBoostTrainer I want to continue training on the best checkpoint 1. Assign `resume_from_checkpoint` failed to load the checkpoint 2. [`XGBoostTrainer.get_model`](https://docs.ray.io/en/latest/train/api/doc/ray.train.xgboost.XGBoostTrainer.get_model.html) can't get the checkpoint either. The first issue error message happens when creating a new trainer with `resume_from_checkpoint` and is quite similar to this https://github.com/ray-project/ray/issues/16375 ```txt 2023-12-05 10:52:43,353 WARNING experiment_state.py:327 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable. 2023-12-05 10:52:53,378 INFO tune.py:1047 -- Total run time: 11.02 seconds (0.14 seconds for the tuning loop). 2023-12-05 10:52:53,393 WARNING experiment_analysis.py:185 -- Failed to fetch metrics for 1 trial(s): - MyXGBoostTrainer_5c19d_00000: FileNotFoundError('Could not fetch metrics for MyXGBoostTrainer_5c19d_00000: both result.json and progress.csv were not found at /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/MyXGBoostTrainer_5c19d_00000_0_2023-12-05_10-52-42') ``` This error message will be like the second one when I remove the early stop config `stop=ExperimentPlateauStopper('train-error', mode='min')` in `RunConfig` ``` xgboost.core.XGBoostError: [11:25:04] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory Stack trace: [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f768b5dc24e] [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f768b6086f3] [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f768b590731] [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f768b5909f9] [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe0959829dd] [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe095982067] [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe09599b1e9] [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7fe09599bc95] [bt] (8) ray::_Inner.train(_PyObject_MakeTpCall+0x3bf) [0x55b5d791b13f] ``` --- And the second issue might be relevant to this https://github.com/ray-project/ray/issues/41374 Either Ray saves the XGBoost model to legacy binary or cannot load the non-default model name from the checkpoint. The workaround seems not working. Where there are warning logs like this ``` (XGBoostTrainer pid=96776, ip=192.168.222.237) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000015) (XGBoostTrainer pid=96776, ip=192.168.222.237) [11:16:41] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified. ``` And if use `XGBoostTrainer.get_model(checkpoint)` will get error ``` XGBoostError: [11:16:54] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000017/model.json failed: No such file or directory Stack trace: [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f49ef86824e] [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f49ef8946f3] [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f49ef81c731] [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f49ef81c9f9] [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f4b8c2bd9dd] [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f4b8c2bd067] [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f4b8c2d61e9] [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f4b8c2d6c95] [bt] (8) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/bin/python(_PyObject_MakeTpCall+0x3bf) [0x5581bcf8513f] ``` ### Versions / Dependencies Python 3.8.13 Packages ``` ray 2.8.1 xgboost-ray 0.1.19 xgboost 2.0.2 ``` OS ``` Distributor ID: Ubuntu Description: Ubuntu 18.04.6 LTS Release: 18.04 Codename: bionic ``` ### Reproduction script The reproduction script is based on the official tutorial [Get Started with XGBoost and LightGBM — Ray 2.8.0](https://docs.ray.io/en/latest/train/distributed-xgboost-lightgbm.html) ### Load data and do the first training ```python import ray from ray.train.xgboost import XGBoostTrainer from ray.train import ScalingConfig, RunConfig, CheckpointConfig, FailureConfig from ray.tune.stopper import ExperimentPlateauStopper ray.init() dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")) train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3 run_config = RunConfig( name="XGBoost_ResumeExperiment", storage_path="/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug", checkpoint_config=CheckpointConfig( checkpoint_frequency=1, num_to_keep=10, checkpoint_at_end=True, checkpoint_score_attribute='train-error', checkpoint_score_order='min', ), failure_config=FailureConfig(max_failures=2), # Remove this will get different error message later stop=ExperimentPlateauStopper('train-error', mode='min'), ) scaling_config = ScalingConfig( num_workers=3, placement_strategy="SPREAD", use_gpu=False, ) trainer = XGBoostTrainer( scaling_config=scaling_config, run_config=run_config, label_column="target", num_boost_round=20, params={ "objective": "binary:logistic", "eval_metric": ["logloss", "error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, ) result = trainer.fit() ``` During fitting will get warnings like this ```txt (XGBoostTrainer pid=96776, ip=192.168.222.237) [11:16:42] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified. (XGBoostTrainer pid=96776, ip=192.168.222.237) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000018) ``` ### Get the Best Checkpoint and Resume ```python checkpoint = result.get_best_checkpoint('valid-logloss', 'min') trainer_continue = XGBoostTrainer( scaling_config=scaling_config, run_config=run_config, label_column="target", num_boost_round=20, params={ "objective": "binary:logistic", "eval_metric": ["logloss", "error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, resume_from_checkpoint=checkpoint ) result_continue = trainer_continue.fit() ``` This will get an error like this when enabling early stopping ``` 2023-12-05 10:25:41,638 WARNING experiment_state.py:327 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable. 2023-12-05 10:25:50,900 INFO tune.py:1047 -- Total run time: 9.96 seconds (0.14 seconds for the tuning loop). 2023-12-05 10:25:50,911 WARNING experiment_analysis.py:185 -- Failed to fetch metrics for 1 trial(s): - MyXGBoostTrainer_95a7d_00000: FileNotFoundError('Could not fetch metrics for MyXGBoostTrainer_95a7d_00000: both result.json and progress.csv were not found at /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/MyXGBoostTrainer_95a7d_00000_0_2023-12-05_10-25-40') ``` And error like this without an early stopping ``` xgboost.core.XGBoostError: [11:25:25] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory Stack trace: [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f2f6f5dc24e] [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f2f6f6086f3] [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f2f6f590731] [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f2f6f5909f9] [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f9976bab9dd] [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f9976bab067] [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f9976bc41e9] [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f9976bc4c95] [bt] (8) ray::_Inner.train(_PyObject_MakeTpCall+0x3bf) [0x556d2854d13f] ``` Which will be the same as ```python model = XGBoostTrainer.get_model(checkpoint) ``` ``` XGBoostError: [11:36:40] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory Stack trace: [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f105a97824e] [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f105a9a46f3] [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f105a92c731] [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f105a92c9f9] [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f11f73d09dd] [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f11f73d0067] [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f11f73e91e9] [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f11f73e9c95] [bt] (8) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/bin/python(_PyObject_MakeTpCall+0x3bf) [0x55ea790af13f] ``` ### Issue Severity High: It blocks me from completing my task.

For the detail investigation check the issue itself.

Topic		Replies	Views
Best model not saved using ray tune for xgboost training Ray Tune	1	592	August 22, 2022
XGboost-Ray Object Creation and Spilling bottleneck	5	499	July 8, 2023
Tuning XGBoost with PBT Ray Tune	8	1213	April 22, 2021
Checkpointing using the Trainable Class Api and Xgboost Ray Tune	2	405	November 5, 2021
Understanding distributed data loading and training xgboost ray Ray Data	10	962	July 19, 2023

XGBoostTrainer Warning: Saving into deprecated binary model format

Related topics