I’m wondering whether MPS can be used to improve the GPU utilization of inference.
Say we have a Ray Serve deployment, in real time, if the request load is low, is it possible to co-schedule another deployment on the same GPU to leverage the MPS for higher utilization?
Hi @valiantljk i am also interested in using MPS, curious if you made any progress on it? thanks!