GPT-J6B Sample Code

Jorgen_Svane · April 10, 2024, 10:02pm

Hi.

I’ve been trying to follow the samplecode for finetuning GPT-J-6B

I’m running on a single node for now with 32 CPUs and two GPUs and have thus reduced various parameters to prevent OOM.

However, it seems to fail because of a ValueError being raised in the train_func:

 File "/tmp/ipykernel_49334/1469602992.py", line 122, in train_func
  File "/home/novelty/miniconda3/envs/ray210/lib/python3.10/site-packages/transformers/trainer.py", line 514, in __init__
    raise ValueError("train_dataset does not implement __len__, max_steps has to be specified")
ValueError: train_dataset does not implement __len__, max_steps has to be specified

My notebook can be found here with the full error trace.

When I try to run it, both the system RAM and the GPUs VRAM appear to load up and the CPUs/GPUs are clearly running as well but then it fails.

Is there an issue between torchtrainer and HF datasets?

Any solutions?

BR

Jorgen

Topic		Replies	Views
Error in HuggingFaceTrainer (Transoformer) v2.4.0 Ray Data	6	827	June 9, 2023
Information on steps_per_epoch in distributed tensorflow Ray Tune	4	656	April 15, 2021
Torch tensor observation is resulting in error during training RLlib	0	221	November 17, 2023
Attempting to deserialize object on a CUDA device... error on 2 GPU machine Ray Tune	3	3003	April 6, 2021
Segfault in torchtrainer for num_workers > 0 in dataloader	1	625	April 9, 2023

GPT-J6B Sample Code

Related topics