Are there any examples/demos on how to do this for inference. Got a big model which needs sequence parallelism and looking to split the workload 8x on a node.
Thanks
Are there any examples/demos on how to do this for inference. Got a big model which needs sequence parallelism and looking to split the workload 8x on a node.
Thanks
vLLM on Ray Serve will give you Tensor Parallelism baked in and probably your best bet. Guide coming soon!
here’s an example of setting up vllm with Ray Serve - Serve a Large Language Model with vLLM — Ray 3.0.0.dev0