Hi.
I’ve been trying to follow the samplecode for finetuning GPT-J-6B
I’m running on a single node for now with 32 CPUs and two GPUs and have thus reduced various parameters to prevent OOM.
However, it seems to fail because of a ValueError being raised in the train_func:
File "/tmp/ipykernel_49334/1469602992.py", line 122, in train_func
File "/home/novelty/miniconda3/envs/ray210/lib/python3.10/site-packages/transformers/trainer.py", line 514, in __init__
raise ValueError("train_dataset does not implement __len__, max_steps has to be specified")
ValueError: train_dataset does not implement __len__, max_steps has to be specified
My notebook can be found here with the full error trace.
When I try to run it, both the system RAM and the GPUs VRAM appear to load up and the CPUs/GPUs are clearly running as well but then it fails.
Is there an issue between torchtrainer and HF datasets?
Any solutions?
BR
Jorgen