Suppose a trained model is used for online inference and I have a gpu with 12G memory.
During the infering for the first image which is using ,say, 6G memory, now a new image comes into play and we can not know its exact size ahead.
if the size is ,say , 3G, then it is ok and the GPU can room both images to infer simultaneously.
but if the new image has a size of 8G, apparently it is bigger than the gpu’s free memory.
My question is , for the case 2, the inference will be scheduled for waiting until enough memory is avaliable or just trigger the oom and manage to run anyway by retry mechanism?
@Li_Bin oom monitor, I believe, may kill the first job it it requires more memory should the inference job pass the memory threshold, and via retry mechanism may schedule it again, meanwhile freeing up memory.
cc: @ClarenceNg how does the oom work in the case 2 when dealing with GRAM memory?
We currently defer GPU memory management completely to the framework being used. So @Li_Bin it will cause a CUDA OOM unless the framework you’re using does something about.
Note that there are advanced techniques to get around these limitations! (Or as simple as getting a more beefy GPU for inference).
Can you share more about your workload? Exactly what model? There are frameworks to do model-parallel inference but they come with significant R&D requirements and usually take a throughput hit.