Ray tune + deepspeed integration

klyakh98 · February 21, 2025, 12:01pm

I am trying to tune my LLM with HuggingFace Trainer.hyperparameter_search() together with the Ray backend. The model I am using has 13B params, so it is impossible to run trials without sharing the optimizer across multiple GPUs (8 GPUs per trial in my case, max concurrent trials = 1).

I tried to add Deepspeed ZeRO-2 config to Trainer, but without success - it always crashes with module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu error. I guess that this error originates from how Ray starts trials. I also tried to start my tune with deepspeed/accelerate [args] tune_script.py instead of python tune_script.py.

Is it possible to tune hyperparams with HF + Ray + Deepspeed? Or maybe other ways like using FSDP…

christina · February 21, 2025, 9:57pm

Hi there! Welcome to the Ray community. Can you tell me more about your CUDA setup or the lines of code where it’s erroring out on? It seems like the module is expecting a CUDA because that’s what the device_id[0] has it on, but instead it seems like it’s trying to use the cpu. What GPU are you running this on?

Topic		Replies	Views
Tensor parallel inference with deepspeed on ray	1	121	September 27, 2024
Starting DeepSpeed Zero_Stage 3 Engine with Ray Ray Core	1	40	April 14, 2025
Use deepspeed in aviary to deploy falcon 40B / Llama 30B Fails	3	968	July 23, 2023
Error When Trying to Tune a Trainable Function	8	2557	August 29, 2023
[SGD] Hydra + RaySGD (PyTorch Lightning) Ray Tune	2	612	June 15, 2021

Ray tune + deepspeed integration

Related topics