I am trying to run the full param training template.04_finetuning_llms_with_deepspeed
on a single 4*A100 (80GB) machine. Already made the necessary changes to load the model locally and store it locally, However, I ran into the following issue.
./run_llama_ft.sh --size=7b [--as-test]
Failure # 1 (occurred at 2023-08-26_07-55-34)
e[36mray::_Inner.train()e[39m (pid=133909, ip=10.14.0.6, actor_id=f1fc079c10f51864ae2f3ac601000000, repr=TorchTrainer)
File "/home/sadra/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 394, in train
raise skipped from exception_cause(skipped)
File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(AssertionError): e[36mray::_RayTrainWorker__execute.get_next()e[39m (pid=136753, ip=10.14.0.6, actor_id=1435c3fad5191d9a19669de301000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f06a4575030>)
File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "/home/sadra/ray/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py", line 307, in training_function
outputs = model(**batch)
File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1768, in forward
loss = self.module(*inputs, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward
logits = self.lm_head(hidden_states)
File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
result = hook(self, args)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/home/sadra/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 306, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}
this is what I see before error happens
(pid=137497) Running: 0.0/96.0 CPU, 0.0/4.0 GPU, 3.56 MiB/1.86 GiB object_store_memory: 0%| | 20/36864 [00:00<04:52, 12023-08-26 07:55:34,707 ERROR tune_controller.py:1507 -- Trial task failed for trial TorchTrainer_c16fe_00000
I tested in various scenarios from deepspeed .8 .9.3 .10 and python 3.8 as well.
torch is 2.0.1 , transformers 4.32,
tested with both Version: 3.0.0.dev0, but with 2.6 errors were similar.
cuda version of pytorch and nvcc are matching. the machine already has 900GB Ram, I also tried disabling zero_load.
Can someone give me ideas how to fix it?