Issue with 04 templete fine-tune llms

I am trying to run the full param training template.04_finetuning_llms_with_deepspeed
on a single 4*A100 (80GB) machine. Already made the necessary changes to load the model locally and store it locally, However, I ran into the following issue.

./run_llama_ft.sh --size=7b [--as-test]
Failure # 1 (occurred at 2023-08-26_07-55-34)
e[36mray::_Inner.train()e[39m (pid=133909, ip=10.14.0.6, actor_id=f1fc079c10f51864ae2f3ac601000000, repr=TorchTrainer)
  File "/home/sadra/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 394, in train
    raise skipped from exception_cause(skipped)
  File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(AssertionError): e[36mray::_RayTrainWorker__execute.get_next()e[39m (pid=136753, ip=10.14.0.6, actor_id=1435c3fad5191d9a19669de301000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f06a4575030>)
  File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/sadra/ray/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py", line 307, in training_function
    outputs = model(**batch)
  File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1768, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward
    logits = self.lm_head(hidden_states)
  File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/home/sadra/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=True)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 306, in fetch_sub_module
    assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}

this is what I see before error happens

(pid=137497) Running: 0.0/96.0 CPU, 0.0/4.0 GPU, 3.56 MiB/1.86 GiB object_store_memory:   0%| | 20/36864 [00:00<04:52, 12023-08-26 07:55:34,707 ERROR tune_controller.py:1507 -- Trial task failed for trial TorchTrainer_c16fe_00000

I tested in various scenarios from deepspeed .8 .9.3 .10 and python 3.8 as well.
torch is 2.0.1 , transformers 4.32,
tested with both Version: 3.0.0.dev0, but with 2.6 errors were similar.

cuda version of pytorch and nvcc are matching. the machine already has 900GB Ram, I also tried disabling zero_load.

Can someone give me ideas how to fix it?

@kourosh can you take a look?

Same error on our side with transformers>=4.32.0 and Ray Torch Trainer.

2376    return func(*args, **kwargs)2375  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context2374    ret_val = func(*args, **kwargs)2373  File "/usr/local/lib/python3.9/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn2372    param_coordinator.fetch_sub_module(sub_module, forward=True)2371  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 494, in pre_sub_module_forward_function2370    return func(*args, **kwargs)

It looks like related to this issue

Hey @Chongxiao_Cao

Here is the dependency that I used that worked.

torch==2.0.0
torchvision==0.15.1
torchaudio==2.0.1
git+https://github.com/huggingface/transformers.git@d0c1aeb
deepspeed==0.10.0
fairscale==0.4.13
peft==0.5.0
datasets==2.14.4
accelerate==0.21.0
evaluate==0.4.0
bitsandbytes==0.41.1
wandb==0.15.8
pytorch-lightning==2.0.6
protobuf<3.21.0
torchmetrics==1.0.3
lm_eval==0.3.0
tiktoken==0.1.2
sentencepiece==0.1.99
urllib3<1.27