Config tune variables aren't matching the training types

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have this issue as the config tune variables are not passed through the training function, as they are printed as ray.tune.sample. Categorical object

ValueError: batch_size should be a positive integer value, but got batch_size=<ray.tune.search.sample.Categorical object at 0x0000023E4A0A2580>

Hi @nwahba,

Could you provide the param_space you are passing into the Tuner?

def main(num_samples=10, max_num_epochs=10):
    data_dir = os.path.abspath("./state-10")
    config = {
        'encoder': [501,1024,512,64,32,8,8],
        'decoder': [8,16,32,64,128,501],
        'loss1': tune.grid_search([0.001, 0.01, 0.1, 1.0]),
        'loss2': tune.grid_search([0.001, 0.01, 0.1, 1.0]),
        'loss3': tune.grid_search([0.001, 0.01, 0.1, 1.0]),
        'lr': tune.loguniform(1e-4, 1e-1),
        'batch_size': tune.choice([2, 4, 8, 16]),
        'ts':  tune.choice([2, 4, 5, 10]),
        'IsOld':'N'
    }
    trainer = train_PMV_01(config,checkpoint_dir=None, data_dir=data_dir)
    scheduler = ASHAScheduler(
        metric="loss",
        mode="min",
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)
    reporter = CLIReporter(
        metric_columns=["loss", "training_iteration"])
    results = tune.run(
        train_PMV_01, 
        config=config,
        fail_fast="raise",
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter)


    best_trial = results.get_best_trial("loss", "min", "last")
    print("Best trial config: {}".format(best_trial.config))
    print("Best trial final validation loss: {}".format(
        best_trial.last_result["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_trial.last_result["accuracy"]))

    best_trained_model = model(best_trial.config['encoder'], best_trial.config['decoder'])
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if gpus_per_trial > 1:
            best_trained_model = nn.DataParallel(best_trained_model)
    best_trained_model.to(device)

    best_checkpoint_dir = best_trial.checkpoint.value
    model_state, optimizer_state = torch.load(os.path.join(
        best_checkpoint_dir, "checkpoint"))
    best_trained_model.load_state_dict(model_state)

Thanks for responding. The config parameters are passed through the train_PMV_01 as the following:

def train_PMV_01(config,checkpoint_dir=None, data_dir=None):
    model = K_autoencoder_01(config['encoder'],config['decoder'])
    model.train()
    device = "cpu"
    model.to(device)
    loss_function = loss(config['loss1'], config['loss2'], config['loss3'], config['ts'])
    train_loader, val_loader, test_loader = get_dataloaders(data_dir, config['batch_size'])
    optimizer = torch.optim.Adam (model.parameters(), lr=config['lr'])
    
    if checkpoint_dir:
        model_state, optimizer_state = torch.load(os.path.join(checkpoint_dir, "checkpoint"))
        model.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)
    
    
    
    for epoch in range(10):
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(train_loader):
            X=data
            X = X.to(device)
            model.zero_grad()

            Loss = loss_function(model,X)
            Loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 10.)
            optimizer.step()

            running_loss += Loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                            running_loss / epoch_steps))
                running_loss = 0.0
        val_loss = 0.0
        val_steps = 0 
        for i, data in enumerate(val_loader):
            with torch.no_grad():
                X=data
                X = X.to(device)
                Loss = loss_function(model,X)
                Loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), 10.)
                optimizer.step()
                val_loss += Loss.item()
                val_steps += 1
                
        with tune.checkpoint_dir(epoch) as checkpoint_dir:
            path = os.path.join(checkpoint_dir, "checkpoint")
            torch.save((model.state_dict(), optimizer.state_dict()), path)

        tune.report(loss=(val_loss / val_steps))
    print("Finished Training")

I would receive this error when I try to run the main function.
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
RayTaskError(TypeError): ray::ImplicitFunc.train() (pid=14256, ip=127.0.0.1, repr=train_PMV_01)

In addition, if I run the training function before running the tune.run function as the following:
trainer = train_PMV_01(config,checkpoint_dir=None, data_dir=data_dir)
I would get this error for the batch size:
ValueError: batch_size should be a positive integer value, but got batch_size=<ray.tune.search.sample.Categorical object at 0x0000023E4A1CCD90>get_dataloadersget_dataloaders

Hi @nwahba,

It seems like the problem is that checkpoint_dir and data_dir are not specified. You can set to them to be passed as keyword arguments by using functools.partial or tune.with_parameters. You can also just pass these strings through the config.

from functools import partial

analysis = tune.run(
    partial(
        train_PMV_01,
        checkpoint_dir="<your-checkpoint-dir>",
        data_dir="<your-data-dir>"
    ),
    # ...
)

Also, the added line trainer = train_PMV_01(config,checkpoint_dir=None, data_dir=data_dir) will not work, since the config is a search space that needs to be resolved by Tune to generate trials in the grid search. Trying to run the function with this search space will pass the search space objects such as the ray.tune.search.sample.Categorical that you are seeing.

Thank you, Justin, very much for mentioning this. I have applied the call for the data_dir path for the created training dataset, but I received this error:

_pickle.PicklingError: Can't pickle <class '__main__.CFD_Dataset'>: attribute lookup CFD_Dataset on __main__ failed
(func pid=25628) Traceback (most recent call last):
(func pid=25628)   File "<string>", line 1, in <module>
(func pid=25628)   File "C:\Users\WAHBAN\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
(func pid=25628)     exitcode = _main(fd, parent_sentinel)
(func pid=25628)   File "C:\Users\WAHBAN\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
(func pid=25628)     self = reduction.pickle.load(from_parent)
(func pid=25628) EOFError: Ran out of input

Is this error relevant to the data_dir used for this class function?

class CFD_Dataset(torch.utils.data.Dataset):
    def __init__(self,X):
        self.X = X
    def __len__(self):
        return self.X.shape[0]
    def __getitem__(self,index):
        xx = self.X[index]
        idx = index
        print(type(xx))
        print('index is',index)
        return xx
def get_loaders(X, batch_size):
    data_npy = CFD_Dataset(X)
    loader =torch.utils.data.DataLoader(data_npy, batch_size, shuffle=False, num_workers=2,drop_last=False)
    return loader
def get_dataloaders(path_01,batch_size):
    X_train, X_val, X_test = data_torch(path_01)
    train_loader = get_loaders(X_train, batch_size)
    val_loader = get_loaders(X_val, batch_size)
    test_loader = get_loaders(X_test, batch_size)
    return train_loader, val_loader, test_loader

This error looks to be caused by the CFD_Dataset class definition. Where are you defining that class?

Thanks for your response. I defined the CFD_Dataset in two different ways:

  • In the same notebook where I ran the ray.tune and I received the error I shared.

  • Defined the class into a separate file and imported it into the script. Then I received this error:

RuntimeError: The actor with name ImplicitFunc failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:

Traceback (most recent call last):
  File "C:\Users\WAHBAN\Anaconda3\lib\site-packages\ray\_private\function_manager.py", line 625, in _load_actor_class_from_gcs
    actor_class = pickle.loads(pickled_class)
ModuleNotFoundError: No module named 'Data_Loader_05'

The issue with the second way is expected, as Ray cannot access modules that are not in PYTHONPATH (see TorchTrain fails if train_func imports functions from a different file for more details).

Would it be possible for you to provide a simple reproducible example that we can run on our side without external dependencies?

Thanks, will take a look and get back to you!

@nwahba I have tried running your notebook and recieved this error:

ray.exceptions.RayTaskError(RuntimeError): ray::ImplicitFunc.train() (pid=618901, ip=172.31.43.110, repr=train_PMV_01)
  File "/home/ubuntu/ray/python/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File "/home/ubuntu/ray/python/ray/tune/trainable/function_trainable.py", line 335, in entrypoint
    return self._trainable_func(
  File "/home/ubuntu/ray/python/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
    output = fn()
  File "/home/ubuntu/ray/python/ray/tune/trainable/util.py", line 386, in inner
    return trainable(config, **fn_kwargs)
  File "Untitled-1.py", line 297, in train_PMV_01
    Loss = loss_function(model,X)
  File "/home/ubuntu/ray/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "Untitled-1.py", line 256, in forward
    de_en_x =de(en_x)
  File "/home/ubuntu/ray/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "Untitled-1.py", line 212, in forward
    x = self.nonlin(x)
  File "/home/ubuntu/ray/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/ray/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (10x509 and 8x8)

If it’s helpful, we may also schedule a quick call so you can walk me through your setup!

It would be great yes for scheduling a quick call.

1 Like

loader =torch.utils.data.DataLoader(data_npy, batch_size, shuffle=False, num_workers=0,drop_last=False)
changing the num_workers=0 to 0 has changed the error relevant to the ray. tune error. Thank you for the help.