I am trying to tune my LLM with HuggingFace Trainer.hyperparameter_search() together with the Ray backend. The model I am using has 13B params, so it is impossible to run trials without sharing the optimizer across multiple GPUs (8 GPUs per trial in my case, max concurrent trials = 1).
I tried to add Deepspeed ZeRO-2 config to Trainer, but without success - it always crashes with module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
error. I guess that this error originates from how Ray starts trials. I also tried to start my tune with deepspeed/accelerate [args] tune_script.py
instead of python tune_script.py
.
Is it possible to tune hyperparams with HF + Ray + Deepspeed? Or maybe other ways like using FSDP…