How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am currently tuning a lightgbm trainer and using basic data parallelism to speed up the task. I create an s3 upload directory to store the best checkpoint of the tuning result. Here is what my code looks like:
#------------------TRAINER-----------------------#
trainer = LightGBMTrainer(
scaling_config=ScalingConfig(
# Number of workers to use for data parallelism.
num_workers=NUM_WORKERS,
# Whether to use GPU acceleration.
use_gpu=False,
resources_per_worker={"CPU": CPU_PER_WORKER},
),
label_column="mapped_inter_categorical",
num_boost_round=NUM_BOOST_ROUNDS,
params={
"objective": "multiclass",
"num_class" : NUM_CLASS,
"metric": ["multi_error"],
},
datasets={"train": small_train_dataset, "valid": small_eval_dataset},
callbacks=[TuneReportCheckpointCallback(
metrics = {"valid-multi_error" : "valid-multi_error"},
filename="lightgbm.mdl"
)
],
)
# #---------------------------TUNER--------------------------#
# Define the hyperparameter search space
search_space = {
"params" : {
"learning_rate": tune.loguniform(0.01, 0.5),
"max_depth": tune.randint(1, 30),
"num_leaves": tune.randint(10, 200),
"feature_fraction": tune.uniform(0.1, 1.0),
"subsample": tune.uniform(0.1, 1.0),
},
}
m_scheduler = MedianStoppingRule(
metric="valid-multi_error",
mode = "min",
min_samples_required= MEDIAN_STOPPING_MIN_SAMPLES,
)
tuner = Tuner(
trainer,
param_space=search_space,
tune_config=tune.TuneConfig(
# metric="valid-multi_error",
# mode="min",
scheduler=m_scheduler,
),
run_config=RunConfig(
sync_config=tune.SyncConfig(
upload_dir=UPLOAD_DIR
),
checkpoint_config=air.CheckpointConfig(
checkpoint_score_attribute="valid-multi_error",
checkpoint_score_order="min",
num_to_keep=NUM_CHECKPOINTS_TO_KEEP
)
)
)
#----------------------END------------------------#
result_grid = tuner.fit()
I then try to access the checkpoint through the following
checkpoint = ray.air.Checkpoint.from_uri(check_point_path)
predictor = LightGBMPredictor.from_checkpoint(checkpoint)
However, I get the following error:
File "classifier.py", line 93, in build
ret_val = self._train(input_ds)
File "classifier.py", line 190, in _train
predictor = LightGBMPredictor.from_checkpoint(checkpoint)
File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/ray/train/lightgbm/lightgbm_predictor.py", line 54, in from_checkpoint
model = checkpoint.get_model()
File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/ray/train/lightgbm/lightgbm_checkpoint.py", line 74, in get_model
return lightgbm.Booster(model_file=os.path.join(checkpoint_path, MODEL_KEY))
File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/lightgbm/basic.py", line 2639, in __init__
_safe_call(_LIB.LGBM_BoosterCreateFromModelfile(
File "/.pyenv/versions/3.9.16/envs/test-env/lib/python3.9/site-packages/lightgbm/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Could not open /var/folders/p6/v606wjhx79l4lyhsnh7wt08w0000gn/T/checkpoint_tmp_e926ee8fd4814442933d010a3261ddbd/model
Is this a known bug in ray? what would be an ideal way to load the checkpoint from s3 to test on a dataset?
note : The checkpoint path is an s3 uri of the format:
s3://bucket-name/LightGBMTrainer_2023-03-22_12-33-09/LightGBMTrainer_60a92_00000_0_feature_fraction=0.1261,learning_rate=0.2965,max_depth=2,num_leaves=123,subsample=0.8623_2023-03-22_12-33-11/checkpoint_000025
The checkpoint directory contains the following files:
.is_checkpoint
.metadata.pkl
.tune_metadata
lightgbm.mdl