Ray Tune 'RESOURCE_EXHAUSTED' error?

SlowatKela · January 13, 2022, 1:45pm

Please let me know if this is not an appropriate style of question - I also asked it on Stack Overflow but I saw there’s only 36 questions tagged with Ray there so I thought here would be better.

I’m using ray tune for the first time for hyperparameter optimisation in tensorflow for an LSTM model. I wrote this code:

def network(config,lstm_dim=12, dense_out=1):
    model = Sequential()
    model.add(Input(shape=(30,1280)))
    model.add(Masking(mask_value=0.))
    model.add(Bidirectional(LSTM(lstm_dim)))
    model.add(Dropout(config['dropout1']))
    model.add(Dense(config['dense1'], activation=config['activation1']))
    model.add(Dropout(config['dropout1']))
    model.add(Dense(config['dense2'], activation=config['activation1']))
    model.add(Dropout(config['dropout1']))
    model.add(Dense(dense_out, activation='sigmoid'))
    model.compile(loss='binary_crossentropy',
                      optimizer=config['optimizer1'],
                      metrics=['accuracy'])

    return model


def cv(config):
    checkpointers = []
    scores = []
    histories = []
    fold_number = 0
    kfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42) 

    for train_index, test_index in kfold.split(X_train,y_train):
        model= network(config)
        trainX, trainY, testX, testY = X_train[train_index], y_train[train_index], X_train[test_index],y_train[test_index]
        
        earlyStopping = EarlyStopping(monitor='val_loss', patience=100, verbose=1, mode='min')
        mcp_save = ModelCheckpoint('am_cnn_' + str(fold_number) + '.mdl_wts.hdf5', save_best_only=True, monitor='val_loss', mode='min')
        reduce_lr_loss = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=100, verbose=1, min_delta=config['min_delta1'], mode='min') 

        history = model.fit(trainX, trainY, 
                            epochs=500, 
                            batch_size=32, 
                            callbacks=[earlyStopping, mcp_save, reduce_lr_loss],
                            verbose=1,
                            validation_data=(testX, testY))
        
        _, acc = model.evaluate(testX, testY, verbose=1)
        print('> %.2f' % (acc * 100.0))
        tune.report(mean_loss=acc)

        scores.append(acc)
        histories.append(history)
        checkpointers.append(checkpointer)
        fold_number +=1 

    return scores, histories

analysis = tune.run(
    cross_validation,
    config = {
        "batch_size1": tune.grid_search([8, 16, 32, 64]), 
        #"dropout1": tune.grid_search([0.1, 0.2, 0.3, 0.4, 0.5]),
        "dense1": tune.grid_search([32, 64, 128, 256, 512]),
        #"dense2": tune.grid_search([32, 64, 128, 256, 512]),
        #"activation1":tune.choice(["relu","tanh"]),
        #"min_delta1":tune.grid_search([1e-5,1e-4,1e-3,1e-2,1e-1]),
        #"optimizer1":tune.choice(["adam","SGD"])
        }
)

print('best config', analysis.get_best_config(metric='mean_loss'),mode='min')

And i get this error:

    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.RESOURCE_EXHAUSTED
        details = "Sent message larger than max (558201755 vs. 536870912)"
        debug_error_string = "{"created":"@1642072157.225091248","description":"Sent message larger than max (558201755 vs. 536870912)","file":"src/core/ext/filters/message_size/message_size_filter.cc","file_line":265,"grpc_status":8}"

This is my first attempt at ray tune and i’m really struggling to understand how to implement it into my model - could someone either explain the error to me or show me I should be implementing it?

p.s. I know how to use tuner in keras, I specifically have to use ray tune and tensorflow here.

Thanks.

SlowatKela · January 14, 2022, 4:37pm

Just to note, I get the same error if I use the tune.with_parameters() function as described in here as a potential solution to ‘big’ data set, my data set is <2,000 rows. It runs and finishes without error if I only run the first <200 rows of the data set.

rliaw · January 15, 2022, 12:02am

Hey @SlowatKela ! Can you say a little about your cluster setup? Are you using Ray on Kubernetes?

Topic		Replies	Views
Not fully used resources by ray tune Ray Tune	2	397	August 11, 2021
Ray Trainer looking for more CPU's than that of its initialized on Ray Train	1	722	September 27, 2022
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	1	December 18, 2024
ERROR: Check failed: resource_pair.second > 0 Ray Tune	2	373	October 18, 2021
Tuning a Keras model - no checkpoints saved Ray Tune	7	1472	March 1, 2023

Ray Tune 'RESOURCE_EXHAUSTED' error?

Related topics