Placement group timeout error - not enough resources for the cluster

tigertimwu · May 9, 2022, 2:19am

High: It blocks me to complete my task.

Hi,

I was trying to use xgboost_ray to train model, and used optuna to perform hyper-parameter tuning. It seems that some trials were successul, while some of them failed due to placement group timeout error by Ray.init(). Were I doing anything wrong either in ray.init() or setting up the ray_params?

I have 2 RTX 3090 and 1 RTX 2070, where the RTX 2070 was connected via thunderbolt using an eGPU enclosure.

I init Ray as below, tried to create just one actor to contain all the resources since the data set is quite big. It seems that Ray does not alway release the resource (GPUs or CPUs) after each trial.

Code

from xgboost_ray import RayXGBClassifier, RayParams

ray.init(num_cpus=64,num_gpus=3,include_dashboard=True,
         _temp_dir="/media/tigertimwu/LinuxData/Ray_Temp_Folder", local_mode=False, ignore_reinit_error=True,_enable_object_reconstruction=True)

ray_params = RayParams(
    num_actors=1,
    gpus_per_actor=3,
    cpus_per_actor=60,
    elastic_training=False
)

#Make model using RayXGBClassifier

def make_model(self):

    xgb_params = {"use_label_encoder": False, "verbosity": 1, "objective": "binary:logistic",
                            "predictor": "gpu_predictor", "tree_method": "gpu_hist",
                            "seed": 101, "random_state": 101, "importance_type": "total_gain",
                            "validate_parameters": True, "booster": "gbtree", "gpu_id": 1,
                            "single_precision_histogram": True, "sampling_method": "gradient_based",
                            "grow_policy": "lossguide", "n_jobs": -1}
    cls = RayXGBClassifier(**xgb_params)

    return cls

#Train model and pass ray_params in model.fit

def train_model(self, trial):
 
    model = self.make_model()

    model.fit(x_train, y_train, eval_metric="logloss",
              early_stopping_rounds=30, eval_set=eval_set,
              verbose=True, ray_params=ray_params)


    model_predictions = model.predict(x_test)

    logloss = log_loss(y_test, model_predictions)

    del model


    return logloss

#Hyper-parameters tuning using Optuna

def tune_with_optuna(self):

        sampler = optuna.samplers.TPESampler(seed=10)

        study = optuna.create_study()

        study.optimize(self.train_model, n_trials=5,timeout=None, n_jobs=-1,gc_after_trial=True)

        best_trial, best_params = study.best_trial.value,study.best_trial.params

        return best_params, best_trial

Error - Please note that some trials were still successful, others have this placement froup timeout error and eventually stopped the process.

[W 2022-05-09 06:37:05,053] Trial 1 failed because of the following error: TimeoutError("Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:G': 1.0, 'node:192.168.50.110': 0.9, 'CPU': 4.0, 'object_store_memory': 17330079887.0, 'CPU_group_0_a71a72c12467878391315c41e28af6c3': 60.0, 'GPU_group_0_a71a72c12467878391315c41e28af6c3': 3.0, 'memory': 72381327975.0, 'bundle_group_a71a72c12467878391315c41e28af6c3': 1000.0, 'bundle_group_0_a71a72c12467878391315c41e28af6c3': 1000.0}, resources requested by the placement group: [{'GPU': 3.0, 'CPU': 60.0}]")
Traceback (most recent call last):
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "<ipython-input-2-0c1f19d437eb>", line 845, in tune_train_xgb_cls_model_single
    model.fit(x_train, y_train, eval_metric="logloss",
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost/core.py", line 506, in inner_f
    return f(**kwargs)
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/sklearn.py", line 700, in fit
    self._Booster = train(
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/main.py", line 1387, in train
    pg = _create_placement_group(cpus_per_actor, gpus_per_actor,
  File "/home/tigertimwu/anaconda3/lib/python3.9/site-packages/xgboost_ray/main.py", line 839, in _create_placement_group
    raise TimeoutError("Placement group creation timed out. Make sure "
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:G': 1.0, 'node:192.168.50.110': 0.9, 'CPU': 4.0, 'object_store_memory': 17330079887.0, 'CPU_group_0_a71a72c12467878391315c41e28af6c3': 60.0, 'GPU_group_0_a71a72c12467878391315c41e28af6c3': 3.0, 'memory': 72381327975.0, 'bundle_group_a71a72c12467878391315c41e28af6c3': 1000.0, 'bundle_group_0_a71a72c12467878391315c41e28af6c3': 1000.0}, resources requested by the placement group: [{'GPU': 3.0, 'CPU': 60.0}]

Alex · June 1, 2022, 10:16pm

Hey, can you provide some details about how you set up your cluster?

Also the output of ray status and maybe what the dashboard looks like?

Topic		Replies	Views
Ray actors cannot be scheduled due to resources constraints	19	2059	November 10, 2022
Resource allocation issue for ray tune with horovod on k8s Ray Tune	2	438	September 15, 2022
Ray indicates that the request resource is insufficient Ray Clusters	0	657	December 19, 2022
Ray xgboost ray not use GPU training and OOM Ray Train	0	142	April 30, 2024
Repeatedly getting GCS timeout	8	1998	April 2, 2025

Placement group timeout error - not enough resources for the cluster

Related topics