- High: It blocks me to complete my task.
Hi,
I was trying to use xgboost_ray to train model, and used optuna to perform hyper-parameter tuning. It seems that some trials were successul, while some of them failed due to placement group timeout error by Ray.init(). Were I doing anything wrong either in ray.init() or setting up the ray_params?
I have 2 RTX 3090 and 1 RTX 2070, where the RTX 2070 was connected via thunderbolt using an eGPU enclosure.
I init Ray as below, tried to create just one actor to contain all the resources since the data set is quite big. It seems that Ray does not alway release the resource (GPUs or CPUs) after each trial.
Code
from xgboost_ray import RayXGBClassifier, RayParams
ray.init(num_cpus=64,num_gpus=3,include_dashboard=True,
_temp_dir="/media/tigertimwu/LinuxData/Ray_Temp_Folder", local_mode=False, ignore_reinit_error=True,_enable_object_reconstruction=True)
ray_params = RayParams(
num_actors=1,
gpus_per_actor=3,
cpus_per_actor=60,
elastic_training=False
)
#Make model using RayXGBClassifier
def make_model(self):
xgb_params = {"use_label_encoder": False, "verbosity": 1, "objective": "binary:logistic",
"predictor": "gpu_predictor", "tree_method": "gpu_hist",
"seed": 101, "random_state": 101, "importance_type": "total_gain",
"validate_parameters": True, "booster": "gbtree", "gpu_id": 1,
"single_precision_histogram": True, "sampling_method": "gradient_based",
"grow_policy": "lossguide", "n_jobs": -1}
cls = RayXGBClassifier(**xgb_params)
return cls
#Train model and pass ray_params in model.fit
def train_model(self, trial):
model = self.make_model()
model.fit(x_train, y_train, eval_metric="logloss",
early_stopping_rounds=30, eval_set=eval_set,
verbose=True, ray_params=ray_params)
model_predictions = model.predict(x_test)
logloss = log_loss(y_test, model_predictions)
del model
return logloss
#Hyper-parameters tuning using Optuna
def tune_with_optuna(self):
sampler = optuna.samplers.TPESampler(seed=10)
study = optuna.create_study()
study.optimize(self.train_model, n_trials=5,timeout=None, n_jobs=-1,gc_after_trial=True)
best_trial, best_params = study.best_trial.value,study.best_trial.params
return best_params, best_trial
Error - Please note that some trials were still successful, others have this placement froup timeout error and eventually stopped the process.
[W 2022-05-09 06:37:05,053] Trial 1 failed because of the following error: TimeoutError("Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:G': 1.0, 'node:192.168.50.110': 0.9, 'CPU': 4.0, 'object_store_memory': 17330079887.0, 'CPU_group_0_a71a72c12467878391315c41e28af6c3': 60.0, 'GPU_group_0_a71a72c12467878391315c41e28af6c3': 3.0, 'memory': 72381327975.0, 'bundle_group_a71a72c12467878391315c41e28af6c3': 1000.0, 'bundle_group_0_a71a72c12467878391315c41e28af6c3': 1000.0}, resources requested by the placement group: [{'GPU': 3.0, 'CPU': 60.0}]")
Traceback (most recent call last):
File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
value_or_values = func(trial)
File "<ipython-input-2-0c1f19d437eb>", line 845, in tune_train_xgb_cls_model_single
model.fit(x_train, y_train, eval_metric="logloss",
File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost/core.py", line 506, in inner_f
return f(**kwargs)
File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/sklearn.py", line 700, in fit
self._Booster = train(
File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/main.py", line 1387, in train
pg = _create_placement_group(cpus_per_actor, gpus_per_actor,
File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/main.py", line 839, in _create_placement_group
raise TimeoutError("Placement group creation timed out. Make sure "
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:G': 1.0, 'node:192.168.50.110': 0.9, 'CPU': 4.0, 'object_store_memory': 17330079887.0, 'CPU_group_0_a71a72c12467878391315c41e28af6c3': 60.0, 'GPU_group_0_a71a72c12467878391315c41e28af6c3': 3.0, 'memory': 72381327975.0, 'bundle_group_a71a72c12467878391315c41e28af6c3': 1000.0, 'bundle_group_0_a71a72c12467878391315c41e28af6c3': 1000.0}, resources requested by the placement group: [{'GPU': 3.0, 'CPU': 60.0}]