Unable to create predictor from checkpoint

Akarsh_Bhagavath · March 22, 2023, 9:55pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am currently tuning a lightgbm trainer and using basic data parallelism to speed up the task. I create an s3 upload directory to store the best checkpoint of the tuning result. Here is what my code looks like:

     #------------------TRAINER-----------------------#

    trainer = LightGBMTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=NUM_WORKERS,
        # Whether to use GPU acceleration.
        use_gpu=False,
        resources_per_worker={"CPU": CPU_PER_WORKER},
    ),
    label_column="mapped_inter_categorical",
    num_boost_round=NUM_BOOST_ROUNDS,
    params={
        "objective": "multiclass",
        "num_class" : NUM_CLASS,
        "metric": ["multi_error"],
    },
    datasets={"train": small_train_dataset, "valid": small_eval_dataset},
    callbacks=[TuneReportCheckpointCallback(
        metrics = {"valid-multi_error" : "valid-multi_error"},
        filename="lightgbm.mdl"
        )
        ],
    )
    # #---------------------------TUNER--------------------------#

    # Define the hyperparameter search space
    search_space = {
        "params" : {
        "learning_rate": tune.loguniform(0.01, 0.5),
        "max_depth": tune.randint(1, 30),
        "num_leaves": tune.randint(10, 200),
        "feature_fraction": tune.uniform(0.1, 1.0),
        "subsample": tune.uniform(0.1, 1.0),
    },    
    }

    m_scheduler = MedianStoppingRule(
        metric="valid-multi_error",
        mode = "min",
        min_samples_required= MEDIAN_STOPPING_MIN_SAMPLES,
    )

    tuner = Tuner(
        trainer,
        param_space=search_space,
        tune_config=tune.TuneConfig(
            # metric="valid-multi_error",
            # mode="min",
            scheduler=m_scheduler,
        ),
        run_config=RunConfig(
            sync_config=tune.SyncConfig(
                upload_dir=UPLOAD_DIR
            ),
            checkpoint_config=air.CheckpointConfig(
                checkpoint_score_attribute="valid-multi_error",
                checkpoint_score_order="min",
                num_to_keep=NUM_CHECKPOINTS_TO_KEEP
            )
        )
    )

            #----------------------END------------------------#
    result_grid = tuner.fit()

I then try to access the checkpoint through the following

checkpoint = ray.air.Checkpoint.from_uri(check_point_path)
predictor = LightGBMPredictor.from_checkpoint(checkpoint)

However, I get the following error:

File "classifier.py", line 93, in build
  ret_val = self._train(input_ds)
File "classifier.py", line 190, in _train
  predictor = LightGBMPredictor.from_checkpoint(checkpoint)
File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/ray/train/lightgbm/lightgbm_predictor.py", line 54, in from_checkpoint
  model = checkpoint.get_model()
  File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/ray/train/lightgbm/lightgbm_checkpoint.py", line 74, in get_model
    return lightgbm.Booster(model_file=os.path.join(checkpoint_path, MODEL_KEY))
  File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/lightgbm/basic.py", line 2639, in __init__
    _safe_call(_LIB.LGBM_BoosterCreateFromModelfile(
  File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Could not open /var/folders/p6/v606wjhx79l4lyhsnh7wt08w0000gn/T/checkpoint_tmp_e926ee8fd4814442933d010a3261ddbd/model

Is this a known bug in ray? what would be an ideal way to load the checkpoint from s3 to test on a dataset?

note : The checkpoint path is an s3 uri of the format:
s3://bucket-name/LightGBMTrainer_2023-03-22_12-33-09/LightGBMTrainer_60a92_00000_0_feature_fraction=0.1261,learning_rate=0.2965,max_depth=2,num_leaves=123,subsample=0.8623_2023-03-22_12-33-11/checkpoint_000025

The checkpoint directory contains the following files:
.is_checkpoint
.metadata.pkl
.tune_metadata
lightgbm.mdl

Yard1 · March 23, 2023, 2:18am

Hi, is the /var/folders/p6/v606wjhx79l4lyhsnh7wt08w0000gn/T/checkpoint_tmp_e926ee8fd4814442933d010a3261ddbd/ directory writeable and readable?

The code looks fine to me, so permission issue would be my best guess.

Akarsh_Bhagavath · March 23, 2023, 4:39am

I don’t think I have read and write permissions to that directory. Nor am I able to change it. Is there a way to download the checkpoint to a directory I can edit?

Also open to a workaround solution, not sure why no one else has encountered the issue and posted on the forum before.

Yard1 · March 23, 2023, 4:52pm

We are using tempfile to generate temporary directories. According to the documentation, you should be able to change the directory by setting the TMPDIR , TEMP or TMP environment variables (you’d set it before running the script). See if that helps?

Akarsh_Bhagavath · March 23, 2023, 5:57pm

Figured out the bug, the issue is that we can’t save models under filenames other than “model”. So saving the filename in the trainer as “lightgbm.mdl” caused the bug in the first place.

Yard1 · March 23, 2023, 11:06pm

Got it! Will see if we can make it more explicit.

Topic		Replies	Views
RAY tune does not save checkpoint information under experiment path	0	108	April 7, 2024
Tuner not returning best checkpoint Ray Tune	3	382	April 25, 2023
Error creating RLPredictor using restored checkpoint RLlib	5	468	April 2, 2023
ValueError: The returned checkpoint path must be within the given checkpoint dir Ray Tune	7	401	January 25, 2021
No such file or directory / Performance Bottleneck Ray Train	0	161	June 26, 2024

Unable to create predictor from checkpoint

Related topics