Unable to create predictor from checkpoint

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am currently tuning a lightgbm trainer and using basic data parallelism to speed up the task. I create an s3 upload directory to store the best checkpoint of the tuning result. Here is what my code looks like:

     #------------------TRAINER-----------------------#

    trainer = LightGBMTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=NUM_WORKERS,
        # Whether to use GPU acceleration.
        use_gpu=False,
        resources_per_worker={"CPU": CPU_PER_WORKER},
    ),
    label_column="mapped_inter_categorical",
    num_boost_round=NUM_BOOST_ROUNDS,
    params={
        "objective": "multiclass",
        "num_class" : NUM_CLASS,
        "metric": ["multi_error"],
    },
    datasets={"train": small_train_dataset, "valid": small_eval_dataset},
    callbacks=[TuneReportCheckpointCallback(
        metrics = {"valid-multi_error" : "valid-multi_error"},
        filename="lightgbm.mdl"
        )
        ],
    )
    # #---------------------------TUNER--------------------------#

    # Define the hyperparameter search space
    search_space = {
        "params" : {
        "learning_rate": tune.loguniform(0.01, 0.5),
        "max_depth": tune.randint(1, 30),
        "num_leaves": tune.randint(10, 200),
        "feature_fraction": tune.uniform(0.1, 1.0),
        "subsample": tune.uniform(0.1, 1.0),
    },    
    }

    m_scheduler = MedianStoppingRule(
        metric="valid-multi_error",
        mode = "min",
        min_samples_required= MEDIAN_STOPPING_MIN_SAMPLES,
    )

    tuner = Tuner(
        trainer,
        param_space=search_space,
        tune_config=tune.TuneConfig(
            # metric="valid-multi_error",
            # mode="min",
            scheduler=m_scheduler,
        ),
        run_config=RunConfig(
            sync_config=tune.SyncConfig(
                upload_dir=UPLOAD_DIR
            ),
            checkpoint_config=air.CheckpointConfig(
                checkpoint_score_attribute="valid-multi_error",
                checkpoint_score_order="min",
                num_to_keep=NUM_CHECKPOINTS_TO_KEEP
            )
        )
    )

            #----------------------END------------------------#
    result_grid = tuner.fit()

I then try to access the checkpoint through the following

checkpoint = ray.air.Checkpoint.from_uri(check_point_path)
predictor = LightGBMPredictor.from_checkpoint(checkpoint)

However, I get the following error:

File "classifier.py", line 93, in build
  ret_val = self._train(input_ds)
File "classifier.py", line 190, in _train
  predictor = LightGBMPredictor.from_checkpoint(checkpoint)
File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/ray/train/lightgbm/lightgbm_predictor.py", line 54, in from_checkpoint
  model = checkpoint.get_model()
  File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/ray/train/lightgbm/lightgbm_checkpoint.py", line 74, in get_model
    return lightgbm.Booster(model_file=os.path.join(checkpoint_path, MODEL_KEY))
  File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/lightgbm/basic.py", line 2639, in __init__
    _safe_call(_LIB.LGBM_BoosterCreateFromModelfile(
  File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Could not open /var/folders/p6/v606wjhx79l4lyhsnh7wt08w0000gn/T/checkpoint_tmp_e926ee8fd4814442933d010a3261ddbd/model

Is this a known bug in ray? what would be an ideal way to load the checkpoint from s3 to test on a dataset?

note : The checkpoint path is an s3 uri of the format:
s3://bucket-name/LightGBMTrainer_2023-03-22_12-33-09/LightGBMTrainer_60a92_00000_0_feature_fraction=0.1261,learning_rate=0.2965,max_depth=2,num_leaves=123,subsample=0.8623_2023-03-22_12-33-11/checkpoint_000025

The checkpoint directory contains the following files:
.is_checkpoint
.metadata.pkl
.tune_metadata
lightgbm.mdl

Hi, is the /var/folders/p6/v606wjhx79l4lyhsnh7wt08w0000gn/T/checkpoint_tmp_e926ee8fd4814442933d010a3261ddbd/ directory writeable and readable?

The code looks fine to me, so permission issue would be my best guess.

I don’t think I have read and write permissions to that directory. Nor am I able to change it. Is there a way to download the checkpoint to a directory I can edit?

Also open to a workaround solution, not sure why no one else has encountered the issue and posted on the forum before.

We are using tempfile to generate temporary directories. According to the documentation, you should be able to change the directory by setting the TMPDIR , TEMP or TMP environment variables (you’d set it before running the script). See if that helps?

Figured out the bug, the issue is that we can’t save models under filenames other than “model”. So saving the filename in the trainer as “lightgbm.mdl” caused the bug in the first place.

Got it! Will see if we can make it more explicit.