Custom GRPO pipeline

Hello :waving_hand:

I am currently trying to write my custom RL GRPO pipeline from scratch using Ray lib. Why starting from zero? In order to fine tune large visual language model (like Qwen2.5-VL-72B-Instruct-AWQ) I need to have FULL control over hiperparameters in DeepSpeed/vllm config, which is not the case for most RLHF libraries… I would like to use LoRA additionally.

Long story short, my dream training loop is something like below:
ray init deepspeed engine (2 GPUs) - one actor, tensor parallelism
ray init vllm engine (2 GPUs) - one actor, tensor parallelism
for epoch in epochs:
for step, batch in ray.dataloader():
1. model_infer generate output
2. calculate custom reward based on decoded tokens and logit’s propab.
3. calculate loss
4. deepspeed_model backdrop and forwardpass

Having that on mind, I have few questions.

  1. Problem is with weights compatibility. After first iteration over example, I update weights which, by definition, make two model’s instances different. Is there better way to sync. them without having to save adapters every time and loading them again (thus making new vllm model init)?
  2. Is there a way to set up qwen vl processor with its config or am I missing something?
  3. Does the dataset have to be always in OpenAI format as showed in yours code examples? My point is, it is better in my case to follow custom one’s…

Best regards,
BK