Ray Tune changes the behaviour of my train function

Hi,

when I run my training function everything works perfectly and I get the desired behaviours. I decided to pass my training function through ray tune and I get the following error:

(pid=55480)   File "/Users/paulvalsecchi/PycharmProjects/pythonProject/NCDE GAN code/Solver.py", line 160, in I
(pid=55480)     du_x = x.grad[:, ::step, :]
(pid=55480) TypeError: 'NoneType' object is not subscriptable

This is strange as if I just run my train function I get that x.grad is a tensor populated in such a way that I can subscribe it the way I have done.

I am using a variety of packages including signatory which I suspect might interfere with ray tune, but I don’t understand why I get the desired result when I run train(config) instead of

analysis = tune.run(
    train,
    num_samples=200,
    scheduler=ASHAScheduler(metric="Loss", mode="min", grace_period=10, max_t=200, reduction_factor=4),
    config=config,
    verbose=2)

which gives me the error I get above.

Any help would be greatly appreciated.

I faced a similar issue in the past, for me it was in the order of the arguments in the loss method. Could you share the simplified body of your loss function?

Indeed the error occurs in the loss function. The loss function is fairly long, but the point that I believe to be causing the error is

def I(y_output_u, y_output_v, xv, yv, tv, x, y, t):
    y_output_u.retain_grad()
    y_output_u.backward(torch.ones_like(y_output_u), retain_graph=True)
    du_x = x.grad[:, ::step, :]

y_output_u is the output of the net which takes in a transformed version of x. The transformation of x occurs outside of the train function but the gradients should still work, no?

I believe so. I would suggest to investigate that all the parameters you receive in the loss function are exactly what you expect, in my case I had the hyper-parameters and the configuration mixed up in the order of the kwargs.

I am fairly certain that that is not the case. I have gone through my code once more to check, but as I mentioned above, I get the correct behaviour when I run train(config).

It’s hard to debug this without proper context. Can you share the part of your code where you use the config argument?

@kai I use the config argument as follows:

    n1 = config['n1']
    n2 = config['n2']

    u_net = NeuralRDE(3, logsig_dim, config['u_hidden_dim'], 1, hidden_hidden_dim=config['u_hidden_hidden_dim'], num_layers=config['u_layers'], return_sequences=True).to(device)
    v_net = discriminator(config).to(device)

    optimizer_u = torch.optim.Adam(u_net.parameters(), lr=config['u_rate'])
    optimizer_v = torch.optim.Adam(v_net.parameters(), lr=config['v_rate'])

I have partially resolved the issue by placing some of the functions that train calls within it. This resolves the issue but I do not understand why this would impact the function?

Yes, that’s odd. It shouldn’t interfere with the function. The one thing you might want to check is if the print output of the config argument somehow differs when Tune runs it. Slight differences might be numpy arrays instead of lists, shapes, or iterable dicts.

I’m still lacking context to evaluate this. Can you share your tune search space (the config you pass to tune.run()) and a manual config you pass to your training function that works?

Or, if possible, your complete training code? It can also be a stripped down version.