Placement group timeout error - not enough resources for the cluster

  • High: It blocks me to complete my task.

Hi,

I was trying to use xgboost_ray to train model, and used optuna to perform hyper-parameter tuning. It seems that some trials were successul, while some of them failed due to placement group timeout error by Ray.init(). Were I doing anything wrong either in ray.init() or setting up the ray_params?

I have 2 RTX 3090 and 1 RTX 2070, where the RTX 2070 was connected via thunderbolt using an eGPU enclosure.

I init Ray as below, tried to create just one actor to contain all the resources since the data set is quite big. It seems that Ray does not alway release the resource (GPUs or CPUs) after each trial.

Code

from xgboost_ray import RayXGBClassifier, RayParams

ray.init(num_cpus=64,num_gpus=3,include_dashboard=True,
         _temp_dir="/media/tigertimwu/LinuxData/Ray_Temp_Folder", local_mode=False, ignore_reinit_error=True,_enable_object_reconstruction=True)

ray_params = RayParams(
    num_actors=1,
    gpus_per_actor=3,
    cpus_per_actor=60,
    elastic_training=False
)

#Make model using RayXGBClassifier

def make_model(self):

    xgb_params = {"use_label_encoder": False, "verbosity": 1, "objective": "binary:logistic",
                            "predictor": "gpu_predictor", "tree_method": "gpu_hist",
                            "seed": 101, "random_state": 101, "importance_type": "total_gain",
                            "validate_parameters": True, "booster": "gbtree", "gpu_id": 1,
                            "single_precision_histogram": True, "sampling_method": "gradient_based",
                            "grow_policy": "lossguide", "n_jobs": -1}
    cls = RayXGBClassifier(**xgb_params)

    return cls

#Train model and pass ray_params in model.fit

def train_model(self, trial):
 
    model = self.make_model()

    model.fit(x_train, y_train, eval_metric="logloss",
              early_stopping_rounds=30, eval_set=eval_set,
              verbose=True, ray_params=ray_params)


    model_predictions = model.predict(x_test)

    logloss = log_loss(y_test, model_predictions)

    del model


    return logloss

#Hyper-parameters tuning using Optuna

def tune_with_optuna(self):

        sampler = optuna.samplers.TPESampler(seed=10)

        study = optuna.create_study()

        study.optimize(self.train_model, n_trials=5,timeout=None, n_jobs=-1,gc_after_trial=True)

        best_trial, best_params = study.best_trial.value,study.best_trial.params

        return best_params, best_trial

Error - Please note that some trials were still successful, others have this placement froup timeout error and eventually stopped the process.

[W 2022-05-09 06:37:05,053] Trial 1 failed because of the following error: TimeoutError("Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:G': 1.0, 'node:192.168.50.110': 0.9, 'CPU': 4.0, 'object_store_memory': 17330079887.0, 'CPU_group_0_a71a72c12467878391315c41e28af6c3': 60.0, 'GPU_group_0_a71a72c12467878391315c41e28af6c3': 3.0, 'memory': 72381327975.0, 'bundle_group_a71a72c12467878391315c41e28af6c3': 1000.0, 'bundle_group_0_a71a72c12467878391315c41e28af6c3': 1000.0}, resources requested by the placement group: [{'GPU': 3.0, 'CPU': 60.0}]")
Traceback (most recent call last):
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "<ipython-input-2-0c1f19d437eb>", line 845, in tune_train_xgb_cls_model_single
    model.fit(x_train, y_train, eval_metric="logloss",
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost/core.py", line 506, in inner_f
    return f(**kwargs)
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/sklearn.py", line 700, in fit
    self._Booster = train(
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/main.py", line 1387, in train
    pg = _create_placement_group(cpus_per_actor, gpus_per_actor,
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/main.py", line 839, in _create_placement_group
    raise TimeoutError("Placement group creation timed out. Make sure "
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:G': 1.0, 'node:192.168.50.110': 0.9, 'CPU': 4.0, 'object_store_memory': 17330079887.0, 'CPU_group_0_a71a72c12467878391315c41e28af6c3': 60.0, 'GPU_group_0_a71a72c12467878391315c41e28af6c3': 3.0, 'memory': 72381327975.0, 'bundle_group_a71a72c12467878391315c41e28af6c3': 1000.0, 'bundle_group_0_a71a72c12467878391315c41e28af6c3': 1000.0}, resources requested by the placement group: [{'GPU': 3.0, 'CPU': 60.0}]

Hey, can you provide some details about how you set up your cluster?

Also the output of ray status and maybe what the dashboard looks like?