Sequence/Tensor Parallelism with Ray Serve

skippy · May 22, 2024, 4:29pm

Are there any examples/demos on how to do this for inference. Got a big model which needs sequence parallelism and looking to split the workload 8x on a node.

Thanks

Sam_Chan · May 22, 2024, 10:22pm

vLLM on Ray Serve will give you Tensor Parallelism baked in and probably your best bet. Guide coming soon!

Akshay_Malik · May 23, 2024, 3:57pm

here’s an example of setting up vllm with Ray Serve - Serve a Large Language Model with vLLM — Ray 3.0.0.dev0

Topic		Replies	Views
Has anyone tried Ray Serve with NVIDIA MPS Ray Serve	1	700	March 13, 2024
Serve Pipeline Design Doc -- Open for comments and collaboration Ray Serve	0	434	February 1, 2022
Parallelise Compute Intensive Task	0	3	November 29, 2024
How about a Discord Server?	1	1020	July 16, 2022
Minimizing loading time - using GPUs Ray Clusters	0	356	May 23, 2023

Sequence/Tensor Parallelism with Ray Serve

Related topics