GPT-J6B Sample Code

Hi.

I’ve been trying to follow the samplecode for finetuning GPT-J-6B

I’m running on a single node for now with 32 CPUs and two GPUs and have thus reduced various parameters to prevent OOM.

However, it seems to fail because of a ValueError being raised in the train_func:

 File "/tmp/ipykernel_49334/1469602992.py", line 122, in train_func
  File "/home/novelty/miniconda3/envs/ray210/lib/python3.10/site-packages/transformers/trainer.py", line 514, in __init__
    raise ValueError("train_dataset does not implement __len__, max_steps has to be specified")
ValueError: train_dataset does not implement __len__, max_steps has to be specified

My notebook can be found here with the full error trace.

When I try to run it, both the system RAM and the GPUs VRAM appear to load up and the CPUs/GPUs are clearly running as well but then it fails.

Is there an issue between torchtrainer and HF datasets?

Any solutions?

BR

Jorgen