No space left on device - tuneSearchCV - disable saving

Hello,

I am running xgboost classifier :

xgb_class = xgb.XGBClassifier(objective ='multi:softmax',
                                          num_class=nb_classes,
                                          use_label_encoder=False,
                                          seed=123,
                                          enable_categorical=False)

 model = TuneSearchCV(
                xgb_class,
                param_distributions=xgb_params,
                n_trials=NB_TRIALS,
                max_iters=10,
                search_optimization='bohb',
                early_stopping=True,
                scoring='f1_micro',
                n_jobs=NB_CPUS,
                name='Ray tune',
                verbose=0,
                local_dir='./ray_results',
                use_gpu=USE_GPU,
                )

and training stops at some point because no space left on device.

The directory './ray_results' consumes 45GB and wants more!

  1. How can I disable the saving to directory?

  2. Is there a way to just save only the best trial using the above schema? ( I mean using TuneSearchCV)

I tried to use:

os.environ["TUNE_MAX_PENDING_TRIALS_PG"] = "1"
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"

but still the same

Hi @George , I had a similar discussion here about disabling as much as possible to run a Tune job but not sure if one can disable the creation of .json files keeping states and all other directories that get created. @kai , is that possible ?

1 Like

Hey @George and @max_ronda,

I don’t think it’s possible to disable the creation of the directories.

tune.fit returns an ExperimentAnalysis, and ExperimentAnalysis requires a checkpoint. In other words, if you don’t have a checkpoint, tune.fit won’t work.

cc @kai did I miss anything?

Ok, but what about the size? GBs? It’s huge!

Hi @George can you look into the directories and see what is taking up so much space?

du -hs ~/ray_results/*

and then continue this in the subdirectories to see if this is due to many small files or some very large files (and what those files are).

How many trials are you running (what is NB_TRIALS)?

HI @kai !

I will inform you on Monday because I don’t have access now.

Just one trial, takes around 15-20GB if I remember well!

Thanks George! It’s unexpected so I’m really curious to see what is causing this disk usage.

1 Like

Hi @kai !

So, for one trial:

  1. We have a folder Trainable_: 4.1GB
    The big files are :

params.json: 593.6MB
params.pkl: 144.2MB
result.json: 3.4GB

  1. A file experiment_state-2022-10-31_08-54-45.json : 865.3MB

  2. A file search_gen_state-2022-10-31_08-54-45.json: 144.2MB

I hava a file params.py in my code that contains the configuration parameters.
It uses :

import logging
from ray import tune

Hi @kai ! Any ideas about this?

Thanks!

It looks like there is a large object contained in the parameter space. Can you show us how your param_distributions looks like?

The TuneSearchCV param_distributions parameter gets passed to Ray Tune. It is saved both as params.json and params.pkl. The size of these files show that there is something large stored in it (usually these files are a few kilobytes at most). The size difference also suggests that it could be an array of some sorts, as it will be compressed in the pickle format.

Ray Tune also stores trial configurations in the experiment checkpoints, that’s why the experiment state and search gen state are also large. Lastly, we do log config parameters everytime we get a result, that’s why the result.json is growing over time.

Generally, the parameter space should only contain primitives (numbers, strings) or simple structures (dicts, lists), but not data.

If you need to access external data in your xgb_class (feel free to show this as well), you should usually pass references to it (e.g. pass the location of a file containing the data in the parameter spoace), not the data itself. With vanilla Ray Tune we also have the possibility to use the Ray object store to upload the data to a central location (the Ray object store) before training. I’m happy to help with this, but we’ll need to see a bit more code for this.

Hi @kai , thanks for the help!

My params file is:

import logging
from ray import tune
       


FILENAME = 'df_train_initial'
RAY_TUNER = True
NB_TRIALS = 3
MLFLOW = False

TEST = 0.1 
VAL = 0.2  
NFOLDS = 3 

# create a logger
logger = logging.getLogger('Classification_logger')

DEBUG = False

if DEBUG:
    SEED = 123
else:
    SEED = None


EARLY_STOP = 10 
SEARCH_OPTIMIZATION = 'bohb' 
EVAL_METRIC = ["merror", "mlogloss"] 
USE_GPU = True

# params for xgb model
xgb_params = {
    'n_estimators': tune.randint(10, 80),
    'reg_alpha': tune.loguniform(0.1, 100),
    'booster': tune.choice(['gbtree', 'gblinear']),
    'colsample_bylevel': tune.uniform(0.05, 0.5), 
    'colsample_bytree': tune.uniform(0.05, 0.5), 
    'learning_rate': tune.uniform(0.001, 0.4),  
    'reg_lambda': tune.loguniform(0.1, 100),  
    'subsample': tune.uniform(0.2, 0.7), 
    "max_depth": tune.randint(1, 10),
    "min_child_weight": tune.choice([1, 2, 3]),
    "eta": tune.loguniform(1e-4, 1e-1),
    }

In the code where I have the xgb model I have something like this:

if RAY_TUNER:
    import ray
    if not ray.is_initialized():
        ray.init(num_cpus=4)

    
                
def xgb_model(x_train, 
              y_train,
              x_val,
              y_val,
              features,
              classes):
    

    kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=SEED)
    fold = 1
    fig, ax = plt.subplots(NFOLDS, len(EVAL_METRIC), sharex=True)
    fig.tight_layout()

    for i, (train_idx, val_idx) in enumerate(kfold.split(x_train, y_train)):
        x_train_, y_train_ = x_train[train_idx, :], y_train[train_idx]
        x_val_, y_val_ = x_train[val_idx, :], y_train[val_idx]
        
        # I am applying SMOTE here
        ...       
        ...
       
        oversample = BorderlineSMOTE(sampling_strategy=over)
        undersample = RandomUnderSampler(sampling_strategy=under)
        steps = [('o', oversample), ('u', undersample)]
        pipeline = Pipeline(steps=steps)
        
        X_balanced, y_balanced = pipeline.fit_resample(x_train_, y_train_)
    
        if RAY_TUNER:

            xgb_class = xgb.XGBClassifier(objective ='multi:softprob',
                                          num_class=nb_classes,
                                          use_label_encoder=False,
                                          seed=SEED,
                                          enable_categorical=False)

            model = TuneSearchCV(
                xgb_class,
                param_distributions=xgb_params,
                n_trials=NB_TRIALS,
                max_iters=15,
                search_optimization=SEARCH_OPTIMIZATION,
                early_stopping=True,
                scoring='f1_micro',
                n_jobs=NB_CPUS,
                name='Ray tune',
                verbose=0,
                local_dir='./ray_results',
                use_gpu=USE_GPU,
                )
        ...
        # Train the model
        history = model.fit(X_balanced,
                            y_balanced,
                            eval_metric=EVAL_METRIC,
                            early_stopping_rounds=EARLY_STOP,
                            eval_set=[(X_balanced, y_balanced), (x_val_, y_val_)])
   
        if RAY_TUNER:
            best_model = history.best_estimator_
            
        ...

I guess what we need to see is

                param_distributions=xgb_params,

how is the xgb_params variable initialized?

Also, where are you using the xgb_model?

Hi @Kai and thanks for helping.

The xgb_params is defined in the params file, as I noted.

And in the fold loop as you can see I call :

model = TuneSearchCV(
                xgb_class,
                param_distributions=xgb_params,
                n_trials=NB_TRIALS,

The xgb model is been called in another file:




def train(filename):
   

        ...
        # run model    
        model = xgb_model(x_train.values,
                          y_train,
                          x_val.values,
                          y_val,
                          x_train.columns.tolist(),
                          classes)
        
       ...

Sorry, my bad, I didn’t see the scroll bars in the posted code.

Thank you very much for all the information.

My current guess is that the CV variable captures the dataset. In tune-sklearn we have this line:

 cv = check_cv(cv=self.cv, y=y, classifier=classifier)

which returns a generator for the train/test dataset. It could be that this holds the full memory dataset and thus blows up the config.

This is then a bug in tune-sklearn, and we can circumvent it by using tune.with_parameters in tune-sklearn. cc @Yard1 who has been working on tune-sklearn.

For a final confirmation, would you mind loading the pickled params.pkl file:

import sys
import ray.cloudpickle as cloudpickle

with open("path/to/params.pkl", "rb") as w:
    params = cloudpickle.load(w)

for k, v in params.items():
    print("param", k, sys.getsizeof(v))

and we can investigate from our side.

Thanks!

1 Like

Hi @kai .

I just saved the xgb_params dictionary , right?

The result is:

param n_estimators 48
param reg_alpha 48
param booster 48
param colsample_bylevel 48
param colsample_bytree 48
param learning_rate 48
param reg_lambda 48
param subsample 48
param max_depth 48
param min_child_weight 48
param eta 48

Hey @George, could you also load the params.pkl file inside the ~/ray_results folder? The same one we talked about here No space left on device - tuneSearchCV - disable saving - #9 by George

Hi @Yard1 .

I am using a different dataset now (I don’t remember what dataset I was testing back then) but I have still the issues.

Ok, so , for one params.pkl file, we have:

param early_stopping 28
param early_stop_type 48
param groups 16
param cv 48
param fit_params 232
param scoring 232
param max_iters 28
param return_train_score 24
param n_jobs 28
param metric_name 67
param n_estimators 28
param reg_alpha 24
param booster 55
param colsample_bylevel 24
param colsample_bytree 24
param learning_rate 24
param reg_lambda 24
param subsample 24
param max_depth 28
param min_child_weight 28

Also, since we are having this discussion , I receive many many many warnings during training:

The `callbacks.on_trial_result` operation took 15.449 s, which may be a performance bottleneck.
	WARNING util.py:244 -- The `process_trial_result` operation took 15.451 s, which may be a performance bottleneck.
	WARNING util.py:244 -- Processing trial results took 15.452 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
	WARNING util.py:244 -- The `process_trial_result` operation took 15.455 s, which may be a performance bottleneck.
(_Trainable pid=18515) [23:17:58] WARNING: ../src/learner.cc:767: 
(_Trainable pid=18515) Parameters: { "colsample_bylevel", "colsample_bytree", "max_depth", "min_child_weight", "subsample" } are not used.
(_Trainable pid=18515) 
(_Trainable pid=18515) [23:18:32] WARNING: ../src/learner.cc:767: 
(_Trainable pid=18515) Parameters: { "colsample_bylevel", "colsample_bytree", "max_depth", "min_child_weight", "subsample" } are not used.

Thanks!

Thanks! Just to be clear, the disk space issue is still a problem, correct? Would it be possible for you to upload one of the oversized Trainable_ folders so we could take a look?

The warnings seem to come from xgboost itself, perhaps the parameter names are wrong?

Hey @George, I think I narrowed down the issue to the eval_set argument in fit. Could you see if the issue still happens without it present? I’ll be working on a proper fix in the meantime!

1 Like