Handling large datasets results in error

I have a large dataset. I’m using tune.with_parameters to pass the dataset to trainable function. Here is the code for tuning
trainable function

def xgboost_hyper_param(config, data = None):

#max_depth = int(max_depth)
#params = {'max_depth' : max_depth, 'learning_rate': learning_rate, 'gamma': gamma}
trainX, trainY, validX, validY, tr_groups, val_groups = data

train_dmatrix = xgb.DMatrix(trainX[feature_names], trainY, feature_names = feature_names)
valid_dmatrix = xgb.DMatrix(validX[feature_names], validY, feature_names = feature_names)
train_dmatrix.set_group(tr_groups)
valid_dmatrix.set_group(val_groups)


params = config
params['max_depth'] = int(params['max_depth'] )
params['tree_method'] = 'gpu_hist'
params['objective'] = OBJECTIVE
params['learning_rate'] = LEARNING_RATE


model = xgb.train(params, train_dmatrix, num_boost_round = N_ESTIMATORS, maximize = True,
                     evals=[(valid_dmatrix, 'eval')], feval = sharpe_metric, 
                     verbose_eval = False, early_stopping_rounds = 30, 
                     callbacks=[TuneReportCallback({"mean_sharpe": "eval-sharpe"})])

tuning function

algo = TuneBOHB( metric="mean_sharpe", mode="max", seed = 101)
bohb = HyperBandForBOHB(time_attr='training_iteration',
                        metric='mean_sharpe',
                        mode='max',
                        max_t=500,
                        reduction_factor=3,

        )

analysis = tune.run(
                    tune.with_parameters( xgboost_hyper_param, 
                                         data = (trainX, trainY, validX, validY, 
                                                 tr_groups, val_groups)),
                    #metric = "mean_sharpe",
                    #mode = "max",
                    name = f"run_{round_number}",
                    resources_per_trial={"cpu": 4, "gpu":0.1},
                    config=var_space,
                    num_samples=500,
                    local_dir = f'{root_dir}/logs/pairwise',
                    search_alg = algo,
                    scheduler = bohb,
                    #resume = resume

            )

The above setup was working fine till now. The dataset has changed and is much bigger in size. The same setup is giving me the error " ConnectionError: Error 104 while writing to socket. Connection reset by peer."

I have tried to use ray.put to put data in ray object storage. I added following lines to the code


    ray.put(trainX[feature_names])
    ray.put(validX[feature_names])
    ray.put(trainY)
    ray.put(validY)
    ray.put(tr_groups)
    ray.put(val_groups)

Now, I’m getting error “ValueError: The actor ImplicitFunc is too large (795 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB)”.

Can someone here, please help me in resolving the problem?

Hmm, you shouldn’t need to directly call ray.put as tune.with_parameters should implicitly handle this for you.

I believe the original error you were seeing is a result of the same issue (the serialized function is too large). From the provided script, I can’t see anything obvious that would be large, but I might be overlooking something.

As a quick test, can you share what the output of this is?

from ray import cloudpickle as pickle

pickled = pickle.dumps(xgboost_hyper_param)
length_mib = len(pickled) // (1024 * 1024)
print(length_mib)

It is 405 mb. The custom metric function in xgboost had a global variable (pd.DataFrame) leading to big size. Thanks for the help.

Nice! Were you able to move/unlink the global variable and resolve the original issue?

Yes, had to use partials for custom metric.