Config tune variables aren't matching the training types

nwahba · November 28, 2022, 7:23am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have this issue as the config tune variables are not passed through the training function, as they are printed as ray.tune.sample. Categorical object

ValueError: batch_size should be a positive integer value, but got batch_size=<ray.tune.search.sample.Categorical object at 0x0000023E4A0A2580>

justinvyu · November 28, 2022, 9:06pm

Hi @nwahba,

Could you provide the param_space you are passing into the Tuner?

nwahba · November 28, 2022, 11:40pm

def main(num_samples=10, max_num_epochs=10):
    data_dir = os.path.abspath("./state-10")
    config = {
        'encoder': [501,1024,512,64,32,8,8],
        'decoder': [8,16,32,64,128,501],
        'loss1': tune.grid_search([0.001, 0.01, 0.1, 1.0]),
        'loss2': tune.grid_search([0.001, 0.01, 0.1, 1.0]),
        'loss3': tune.grid_search([0.001, 0.01, 0.1, 1.0]),
        'lr': tune.loguniform(1e-4, 1e-1),
        'batch_size': tune.choice([2, 4, 8, 16]),
        'ts':  tune.choice([2, 4, 5, 10]),
        'IsOld':'N'
    }
    trainer = train_PMV_01(config,checkpoint_dir=None, data_dir=data_dir)
    scheduler = ASHAScheduler(
        metric="loss",
        mode="min",
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)
    reporter = CLIReporter(
        metric_columns=["loss", "training_iteration"])
    results = tune.run(
        train_PMV_01, 
        config=config,
        fail_fast="raise",
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter)


    best_trial = results.get_best_trial("loss", "min", "last")
    print("Best trial config: {}".format(best_trial.config))
    print("Best trial final validation loss: {}".format(
        best_trial.last_result["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_trial.last_result["accuracy"]))

    best_trained_model = model(best_trial.config['encoder'], best_trial.config['decoder'])
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if gpus_per_trial > 1:
            best_trained_model = nn.DataParallel(best_trained_model)
    best_trained_model.to(device)

    best_checkpoint_dir = best_trial.checkpoint.value
    model_state, optimizer_state = torch.load(os.path.join(
        best_checkpoint_dir, "checkpoint"))
    best_trained_model.load_state_dict(model_state)

Thanks for responding. The config parameters are passed through the train_PMV_01 as the following:

def train_PMV_01(config,checkpoint_dir=None, data_dir=None):
    model = K_autoencoder_01(config['encoder'],config['decoder'])
    model.train()
    device = "cpu"
    model.to(device)
    loss_function = loss(config['loss1'], config['loss2'], config['loss3'], config['ts'])
    train_loader, val_loader, test_loader = get_dataloaders(data_dir, config['batch_size'])
    optimizer = torch.optim.Adam (model.parameters(), lr=config['lr'])
    
    if checkpoint_dir:
        model_state, optimizer_state = torch.load(os.path.join(checkpoint_dir, "checkpoint"))
        model.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)
    
    
    
    for epoch in range(10):
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(train_loader):
            X=data
            X = X.to(device)
            model.zero_grad()

            Loss = loss_function(model,X)
            Loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 10.)
            optimizer.step()

            running_loss += Loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                            running_loss / epoch_steps))
                running_loss = 0.0
        val_loss = 0.0
        val_steps = 0 
        for i, data in enumerate(val_loader):
            with torch.no_grad():
                X=data
                X = X.to(device)
                Loss = loss_function(model,X)
                Loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), 10.)
                optimizer.step()
                val_loss += Loss.item()
                val_steps += 1
                
        with tune.checkpoint_dir(epoch) as checkpoint_dir:
            path = os.path.join(checkpoint_dir, "checkpoint")
            torch.save((model.state_dict(), optimizer.state_dict()), path)

        tune.report(loss=(val_loss / val_steps))
    print("Finished Training")

I would receive this error when I try to run the main function.
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
RayTaskError(TypeError): ray::ImplicitFunc.train() (pid=14256, ip=127.0.0.1, repr=train_PMV_01)

nwahba · November 29, 2022, 12:04am

In addition, if I run the training function before running the tune.run function as the following:
trainer = train_PMV_01(config,checkpoint_dir=None, data_dir=data_dir)
I would get this error for the batch size:
ValueError: batch_size should be a positive integer value, but got batch_size=<ray.tune.search.sample.Categorical object at 0x0000023E4A1CCD90>get_dataloadersget_dataloaders

justinvyu · November 29, 2022, 12:12am

Hi @nwahba,

It seems like the problem is that checkpoint_dir and data_dir are not specified. You can set to them to be passed as keyword arguments by using functools.partial or tune.with_parameters. You can also just pass these strings through the config.

from functools import partial

analysis = tune.run(
    partial(
        train_PMV_01,
        checkpoint_dir="<your-checkpoint-dir>",
        data_dir="<your-data-dir>"
    ),
    # ...
)

Also, the added line trainer = train_PMV_01(config,checkpoint_dir=None, data_dir=data_dir) will not work, since the config is a search space that needs to be resolved by Tune to generate trials in the grid search. Trying to run the function with this search space will pass the search space objects such as the ray.tune.search.sample.Categorical that you are seeing.

nwahba · November 29, 2022, 1:55am

Thank you, Justin, very much for mentioning this. I have applied the call for the data_dir path for the created training dataset, but I received this error:

_pickle.PicklingError: Can't pickle <class '__main__.CFD_Dataset'>: attribute lookup CFD_Dataset on __main__ failed
(func pid=25628) Traceback (most recent call last):
(func pid=25628)   File "<string>", line 1, in <module>
(func pid=25628)   File "C:\Users\WAHBAN\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
(func pid=25628)     exitcode = _main(fd, parent_sentinel)
(func pid=25628)   File "C:\Users\WAHBAN\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
(func pid=25628)     self = reduction.pickle.load(from_parent)
(func pid=25628) EOFError: Ran out of input

Is this error relevant to the data_dir used for this class function?

class CFD_Dataset(torch.utils.data.Dataset):
    def __init__(self,X):
        self.X = X
    def __len__(self):
        return self.X.shape[0]
    def __getitem__(self,index):
        xx = self.X[index]
        idx = index
        print(type(xx))
        print('index is',index)
        return xx

def get_loaders(X, batch_size):
    data_npy = CFD_Dataset(X)
    loader =torch.utils.data.DataLoader(data_npy, batch_size, shuffle=False, num_workers=2,drop_last=False)
    return loader

def get_dataloaders(path_01,batch_size):
    X_train, X_val, X_test = data_torch(path_01)
    train_loader = get_loaders(X_train, batch_size)
    val_loader = get_loaders(X_val, batch_size)
    test_loader = get_loaders(X_test, batch_size)
    return train_loader, val_loader, test_loader

Yard1 · November 29, 2022, 5:56pm

This error looks to be caused by the CFD_Dataset class definition. Where are you defining that class?

nwahba · November 29, 2022, 11:02pm

Thanks for your response. I defined the CFD_Dataset in two different ways:

In the same notebook where I ran the ray.tune and I received the error I shared.
Defined the class into a separate file and imported it into the script. Then I received this error:

RuntimeError: The actor with name ImplicitFunc failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:

Traceback (most recent call last):
  File "C:\Users\WAHBAN\Anaconda3\lib\site-packages\ray\_private\function_manager.py", line 625, in _load_actor_class_from_gcs
    actor_class = pickle.loads(pickled_class)
ModuleNotFoundError: No module named 'Data_Loader_05'

Yard1 · November 29, 2022, 11:13pm

The issue with the second way is expected, as Ray cannot access modules that are not in PYTHONPATH (see TorchTrain fails if train_func imports functions from a different file for more details).

Would it be possible for you to provide a simple reproducible example that we can run on our side without external dependencies?

Yard1 · November 30, 2022, 10:21am

Thanks, will take a look and get back to you!

Yard1 · November 30, 2022, 6:07pm

@nwahba I have tried running your notebook and recieved this error:

ray.exceptions.RayTaskError(RuntimeError): ray::ImplicitFunc.train() (pid=618901, ip=172.31.43.110, repr=train_PMV_01)
  File "/home/ubuntu/ray/python/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File "/home/ubuntu/ray/python/ray/tune/trainable/function_trainable.py", line 335, in entrypoint
    return self._trainable_func(
  File "/home/ubuntu/ray/python/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
    output = fn()
  File "/home/ubuntu/ray/python/ray/tune/trainable/util.py", line 386, in inner
    return trainable(config, **fn_kwargs)
  File "Untitled-1.py", line 297, in train_PMV_01
    Loss = loss_function(model,X)
  File "/home/ubuntu/ray/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "Untitled-1.py", line 256, in forward
    de_en_x =de(en_x)
  File "/home/ubuntu/ray/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "Untitled-1.py", line 212, in forward
    x = self.nonlin(x)
  File "/home/ubuntu/ray/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/ray/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (10x509 and 8x8)

If it’s helpful, we may also schedule a quick call so you can walk me through your setup!

nwahba · November 30, 2022, 10:15pm

It would be great yes for scheduling a quick call.

nwahba · December 5, 2022, 5:35am

loader =torch.utils.data.DataLoader(data_npy, batch_size, shuffle=False, num_workers=0,drop_last=False)
changing the num_workers=0 to 0 has changed the error relevant to the ray. tune error. Thank you for the help.

Topic		Replies	Views
TypeErrors when passing search spaces to trainable function Ray Tune	2	525	January 6, 2022
Error When Trying to Tune a Trainable Function	8	2584	August 29, 2023
Getting serialization error when using Ray Tune Ray Tune	9	2362	May 21, 2023
ValueError in tuner.fit() Ray Tune	2	554	June 1, 2023
What is the correct technique for incorporating access to large machine learning data sets in a Ray Tune Tuner() object?	0	70	December 2, 2023

Config tune variables aren't matching the training types

Related topics