Low CPU utilization when compared to multiprocessing

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m currently trying to use ray to parallelize pyscipopt MINLP solves with different configurations.
Doing so with traditional multiprocessing (i.e. multiprocessing pool) works and completely saturates all cores.
With ray (local cluster via ray.init()) however, I only ever get two cores saturated, both with low priority

Since these are multiple large jobs, I shouldn’t have any problem with overhead (as can be seen by the fact that traditional multiprocessing works).
I am sending a NN to the remotes (to be run on cpu), but that shouldn’t matter (at least it doesn’t for multiprocessing).

This problem persists regardless of whether I use @ray.remote decorators of ray.util.multiprocessing.Pool to just replace the multiprocessing pool with a ray pool.

I’m using 32GiB RAM, and an AMD Ryzen 7 5800X3D on ray 2.3.1.
As is, I’m unable to use ray to complete my task.

@alexander_mattick Are you running this on a single machine with multiple cores?

I explore using different strategies for MP/MT/Ray. I wonder if you doing the same thing. Not having the code, we can’t tell what could be going awry.

cc: @cade @Chen_Shen

I managed to locate my problem:
The reason the other variant with multiprocessing is able to utilize more cores is due to me using mp.set_start_method("spawn") , without that the plain multiprocessing version is also constrained to only two cores.

Is there an equivalent (or similar) option I can try for ray?

@alexander_mattick If you create the Pool of actors equal to all the cores, it should use them all.

# Let's try that with Ray multiprocessing pool.
import multiprocessing as mp
from ray.util.multiprocessing import Pool

cpu_count = mp.cpu_count()
...
ray_pool = Pool(cpu_count)
lst = list(range(NUM))
results = []
for result in ray_pool.map(<your_function>):
    results.append(result)
ray_pool.terminate()

That’s how I would expect it to work, but it doesn’t:
Here is a minimal version

import torch
from pyscipopt import Model
import multiprocessing as mp
from ray.util.multiprocessing import Pool as rayPool
import pyscipopt as scip
from scipy.spatial.distance import cdist

def make_tsp():
    """
    USE MTZ formulation
    """
    #g_cpu = torch.Generator()
    #if seed is not None:
    #    g_cpu.manual_seed(seed)
    # Define a distance matrix for the cities
    size = 75
    d = torch.randn(size,2,).numpy()*2
    dist_matrix = cdist(d,d)
    #print("TSP size",size)
    # Create a SCIP model
    model = Model("TSP")

    # Define variables
    num_cities = dist_matrix.shape[0]
    x = {}

    for i in range(num_cities):
        for j in range(num_cities):
            if i != j:
                x[i,j] = model.addVar(vtype="B", name=f"x_{i}_{j}")
    u={}
    for i in range(1,num_cities):
        u[i] = model.addVar(vtype="I", name=f"u_{i}")
        model.addCons(1<=(u[i]<= num_cities-1), name=f"u_{i}_constraint")

    # Define constraints
    # Each city must be visited exactly once
    for i in range(num_cities):
        model.addCons(scip.quicksum(x[i,j] for j in range(num_cities) if j != i) == 1, name=f"city_{i}_visited_origin")
    for j in range(num_cities):
        model.addCons(scip.quicksum(x[i,j] for i in range(num_cities) if j != i) == 1, name=f"city_{j}_visited_dest")
    # There should be no subtours
    for i in range(1,num_cities):
        for j in range(1,num_cities):
            if i != j:
                model.addCons(u[i] - u[j] + (num_cities - 1)*x[i,j]<= num_cities-2, name=f"no_subtour_{i}_{j}")
    

    # Define objective
    model.setObjective(scip.quicksum(dist_matrix[i,j] * x[i,j] for i in range(num_cities) for j in range(num_cities) if j != i), sense="minimize")

    return model


def launch_models(pool, num_proc: int):
    g = torch.Generator()
    arg_list = []
    for _ in range(num_proc):
        seed = g.seed()
        f = make_tsp
        arg_list.append((seed,  f))
    result = pool.starmap_async(__make_and_optimize, arg_list)
    g, bg = [], []
    for res in result.get():
        (gap, baseline_gap) = res.get()
        g.append(gap)
        bg.append(baseline_gap)
        print("retrieved")
    return g, bg


def __make_and_optimize(seed,  f):
    print("started")
    torch.manual_seed(seed)
    model = f()
    model.setRealParam("limits/time", 60)
    model.hideOutput()
    model.optimize()
    baseline_gap=model.getGap()
    baseline_nodes = model.getNNodes()

    model.freeTransform()
    #model.freeProb()
    model.setRealParam("limits/time", 60)
    model.hideOutput()
    model.optimize()
    gap = model.getGap()
    print("done converting, starting to send to main process")
    return (gap, baseline_gap)




def main():
    mp.set_start_method("spawn")
    pool = mp.Pool(processes=16)
    g,bg = launch_models(pool, 16)
    print(g)
    print(bg)

if __name__ == "__main__":
    main()

It works with:
mp.Pool and mp.set_start_method("spawn")
It doesn’t work with just mp.Pool or rayPool

@alexander_mattick

So effectively, you saying with RayPool of 16 you only get 2 Actors running, even though you have 16 cores. For Ray equivalent Pool, you don’t need the “spawn”. It should create a pool of 16 actors.

cc: @sangcho Any insight here?

@alexander_mattick I ran two of my examples on my laptop and they both seem to launch all the actors requested.

  1. misc-code/mp_all.py at master · dmatrix/misc-code · GitHub
  2. misc-code/mp_pool.py at master · dmatrix/misc-code · GitHub

@alexander_mattick

Can you try just using map(func) here result = pool.starmap_async(__make_and_optimize, arg_list) instead of starmap_async` for your Ray Pool?

It does create 16 actors, but they are all locked to the same two cores:
using map instead of starmap doesn’t change that:

To compare, this is how it looks like with multiprocessing and startmethod spawn


In short: the problem isn’t that ray fundamentally produces to few workers, it’s that the workers it does produce are locked to the same core and therefore compete for cpu time while the rest of the system is idling.

@alexander_mattick Intriguing. It seems like some facet of code or something TBD pins or forces an affinity of the actors to those two cores.

This is the second issue I’ve seen related to something pinning process to a subset of all available cores. The first one turns out to be an issue with the conda release of MKL Conda Pytorch set processor affinity to the first physical core after fork · Issue #99625 · pytorch/pytorch · GitHub. MKL calls sched_setaffinity after fork (?).

Since you’re using torch I suspect it’s the same issue. Can you try the workarounds described in the Torch issue?

We may have a flag to use spawn instead of fork… let me ask around

Another alternative is to just set shed_setaffinity to all CPUs in your mapper code, but that’s not exactly ideal :joy:

Thank you, the linked solution worked, specifically calling ray.init() before import torch.

1 Like

Hmm we should probably not inherit the CPU affinity setting on raylet. There seems to be no benefit from this behavior.

I will create an issue