Custom GRPO pipeline

BKK_0820 · April 7, 2025, 6:29pm

Hello

I am currently trying to write my custom RL GRPO pipeline from scratch using Ray lib. Why starting from zero? In order to fine tune large visual language model (like Qwen2.5-VL-72B-Instruct-AWQ) I need to have FULL control over hiperparameters in DeepSpeed/vllm config, which is not the case for most RLHF libraries… I would like to use LoRA additionally.

Long story short, my dream training loop is something like below:
ray init deepspeed engine (2 GPUs) - one actor, tensor parallelism
ray init vllm engine (2 GPUs) - one actor, tensor parallelism
for epoch in epochs:
for step, batch in ray.dataloader():
1. model_infer generate output
2. calculate custom reward based on decoded tokens and logit’s propab.
3. calculate loss
4. deepspeed_model backdrop and forwardpass

Having that on mind, I have few questions.

Problem is with weights compatibility. After first iteration over example, I update weights which, by definition, make two model’s instances different. Is there better way to sync. them without having to save adapters every time and loading them again (thus making new vllm model init)?
Is there a way to set up qwen vl processor with its config or am I missing something?
Does the dataset have to be always in OpenAI format as showed in yours code examples? My point is, it is better in my case to follow custom one’s…

Best regards,
BK

Topic		Replies	Views
Starting DeepSpeed Zero_Stage 3 Engine with Ray Ray Core	1	35	April 14, 2025
Tensor parallel inference with deepspeed on ray	1	108	September 27, 2024
I cant get my custom network to work RLlib	7	101	April 11, 2025
Ray tune + deepspeed integration Ray Tune	1	36	February 21, 2025
Custom RLmodule Configure Algorithm, Training, Evaluation, Scaling	2	24	May 8, 2025

Custom GRPO pipeline

Related topics