Sequence/Tensor Parallelism with Ray Serve

Are there any examples/demos on how to do this for inference. Got a big model which needs sequence parallelism and looking to split the workload 8x on a node.


vLLM on Ray Serve will give you Tensor Parallelism baked in and probably your best bet. Guide coming soon!

here’s an example of setting up vllm with Ray Serve - Serve a Large Language Model with vLLM — Ray 3.0.0.dev0