I’m a newbie at distributed computing so I’m unsure of this approach makes sense but let’s say I have problems that create some number of jobs between a dozen and several thousands. These jobs can take a few seconds or a few minutes and each use 1 or 2 cores and sometimes a GPU.
I don’t want to schedule 1000 jobs and spin up 1000 cores, because if these jobs take a few seconds each I’m absolutely fine to have a queue of 100 jobs for just a couple of CPUs.
So my problem essentially is that even if a CPU is “busy” at the moment I send another job, that doesn’t mean the CPU won’t be free within a few seconds, and hence scaling to another node is a massive waste of resources.
Ideally I’d want to keep track of the number of jobs I’ve sent, the resources they take and a rough time estimate from my end to judge whether the backlog is acceptable or requires more resources (and hence more nodes).