Error Encountered While Training Generative AI Model in Aviary

Hi everyone,

I’ve been diving into the world of generative AI within the Aviary framework, and I’ve hit a bit of a roadblock.
I also check this this - GPT-J-6B Fine-Tuning with Ray Train and DeepSpeed — Ray 2.22.0 and
During my generative ai training course sessions with a generative AI model, I keep encountering a persistent error that’s proving to be quite puzzling. It seems that whenever the model reaches a certain stage of training, it abruptly stops and throws an error message that reads, “Error: Unable to converge due to gradient vanishing problem.”

I’ve tried adjusting various hyperparameters, tinkering with the architecture, and even modifying the dataset preprocessing steps, but nothing seems to resolve this issue. I’ve also checked for any anomalies in the dataset, but everything appears to be in order.

Has anyone else encountered a similar error while working with generative AI models in Aviary?

Thanks in advance.

1 Like

Hard to debug without knowing more; can you share your Ray setup any repro script relevant (information on what model you’re attempting to train plus what data format you are training in are all relevant)?

Thanks for helping @Sam_Chan