Low CPU utilization when compared to multiprocessing

alexander_mattick · May 14, 2023, 9:06pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’m currently trying to use ray to parallelize pyscipopt MINLP solves with different configurations.
Doing so with traditional multiprocessing (i.e. multiprocessing pool) works and completely saturates all cores.
With ray (local cluster via ray.init()) however, I only ever get two cores saturated, both with low priority

Since these are multiple large jobs, I shouldn’t have any problem with overhead (as can be seen by the fact that traditional multiprocessing works).
I am sending a NN to the remotes (to be run on cpu), but that shouldn’t matter (at least it doesn’t for multiprocessing).

This problem persists regardless of whether I use @ray.remote decorators of ray.util.multiprocessing.Pool to just replace the multiprocessing pool with a ray pool.

I’m using 32GiB RAM, and an AMD Ryzen 7 5800X3D on ray 2.3.1.
As is, I’m unable to use ray to complete my task.

Jules_Damji · May 15, 2023, 6:52pm

@alexander_mattick Are you running this on a single machine with multiple cores?

I explore using different strategies for MP/MT/Ray. I wonder if you doing the same thing. Not having the code, we can’t tell what could be going awry.

cc: @cade @Chen_Shen

alexander_mattick · May 15, 2023, 8:06pm

I managed to locate my problem:
The reason the other variant with multiprocessing is able to utilize more cores is due to me using mp.set_start_method("spawn") , without that the plain multiprocessing version is also constrained to only two cores.

Is there an equivalent (or similar) option I can try for ray?

Jules_Damji · May 15, 2023, 8:24pm

@alexander_mattick If you create the Pool of actors equal to all the cores, it should use them all.

# Let's try that with Ray multiprocessing pool.
import multiprocessing as mp
from ray.util.multiprocessing import Pool

cpu_count = mp.cpu_count()
...
ray_pool = Pool(cpu_count)
lst = list(range(NUM))
results = []
for result in ray_pool.map(<your_function>):
    results.append(result)
ray_pool.terminate()

alexander_mattick · May 15, 2023, 8:53pm

That’s how I would expect it to work, but it doesn’t:
Here is a minimal version

import torch
from pyscipopt import Model
import multiprocessing as mp
from ray.util.multiprocessing import Pool as rayPool
import pyscipopt as scip
from scipy.spatial.distance import cdist

def make_tsp():
    """
    USE MTZ formulation
    """
    #g_cpu = torch.Generator()
    #if seed is not None:
    #    g_cpu.manual_seed(seed)
    # Define a distance matrix for the cities
    size = 75
    d = torch.randn(size,2,).numpy()*2
    dist_matrix = cdist(d,d)
    #print("TSP size",size)
    # Create a SCIP model
    model = Model("TSP")

    # Define variables
    num_cities = dist_matrix.shape[0]
    x = {}

    for i in range(num_cities):
        for j in range(num_cities):
            if i != j:
                x[i,j] = model.addVar(vtype="B", name=f"x_{i}_{j}")
    u={}
    for i in range(1,num_cities):
        u[i] = model.addVar(vtype="I", name=f"u_{i}")
        model.addCons(1<=(u[i]<= num_cities-1), name=f"u_{i}_constraint")

    # Define constraints
    # Each city must be visited exactly once
    for i in range(num_cities):
        model.addCons(scip.quicksum(x[i,j] for j in range(num_cities) if j != i) == 1, name=f"city_{i}_visited_origin")
    for j in range(num_cities):
        model.addCons(scip.quicksum(x[i,j] for i in range(num_cities) if j != i) == 1, name=f"city_{j}_visited_dest")
    # There should be no subtours
    for i in range(1,num_cities):
        for j in range(1,num_cities):
            if i != j:
                model.addCons(u[i] - u[j] + (num_cities - 1)*x[i,j]<= num_cities-2, name=f"no_subtour_{i}_{j}")
    

    # Define objective
    model.setObjective(scip.quicksum(dist_matrix[i,j] * x[i,j] for i in range(num_cities) for j in range(num_cities) if j != i), sense="minimize")

    return model


def launch_models(pool, num_proc: int):
    g = torch.Generator()
    arg_list = []
    for _ in range(num_proc):
        seed = g.seed()
        f = make_tsp
        arg_list.append((seed,  f))
    result = pool.starmap_async(__make_and_optimize, arg_list)
    g, bg = [], []
    for res in result.get():
        (gap, baseline_gap) = res.get()
        g.append(gap)
        bg.append(baseline_gap)
        print("retrieved")
    return g, bg


def __make_and_optimize(seed,  f):
    print("started")
    torch.manual_seed(seed)
    model = f()
    model.setRealParam("limits/time", 60)
    model.hideOutput()
    model.optimize()
    baseline_gap=model.getGap()
    baseline_nodes = model.getNNodes()

    model.freeTransform()
    #model.freeProb()
    model.setRealParam("limits/time", 60)
    model.hideOutput()
    model.optimize()
    gap = model.getGap()
    print("done converting, starting to send to main process")
    return (gap, baseline_gap)




def main():
    mp.set_start_method("spawn")
    pool = mp.Pool(processes=16)
    g,bg = launch_models(pool, 16)
    print(g)
    print(bg)

if __name__ == "__main__":
    main()

It works with:
mp.Pool and mp.set_start_method("spawn")
It doesn’t work with just mp.Pool or rayPool

Jules_Damji · May 16, 2023, 12:44am

@alexander_mattick

So effectively, you saying with RayPool of 16 you only get 2 Actors running, even though you have 16 cores. For Ray equivalent Pool, you don’t need the “spawn”. It should create a pool of 16 actors.

cc: @sangcho Any insight here?

Jules_Damji · May 16, 2023, 1:14am

@alexander_mattick I ran two of my examples on my laptop and they both seem to launch all the actors requested.

Jules_Damji · May 16, 2023, 1:42am

@alexander_mattick

Can you try just using map(func) here result = pool.starmap_async(__make_and_optimize, arg_list) instead of starmap_async` for your Ray Pool?

alexander_mattick · May 16, 2023, 7:37am

It does create 16 actors, but they are all locked to the same two cores:
using map instead of starmap doesn’t change that:

alexander_mattick · May 16, 2023, 7:38am

To compare, this is how it looks like with multiprocessing and startmethod spawn

In short: the problem isn’t that ray fundamentally produces to few workers, it’s that the workers it does produce are locked to the same core and therefore compete for cpu time while the rest of the system is idling.

Jules_Damji · May 16, 2023, 6:12pm

@alexander_mattick Intriguing. It seems like some facet of code or something TBD pins or forces an affinity of the actors to those two cores.

cade · May 16, 2023, 8:56pm

This is the second issue I’ve seen related to something pinning process to a subset of all available cores. The first one turns out to be an issue with the conda release of MKL Conda Pytorch set processor affinity to the first physical core after fork · Issue #99625 · pytorch/pytorch · GitHub. MKL calls sched_setaffinity after fork (?).

Since you’re using torch I suspect it’s the same issue. Can you try the workarounds described in the Torch issue?

We may have a flag to use spawn instead of fork… let me ask around

cade · May 16, 2023, 8:57pm

Another alternative is to just set shed_setaffinity to all CPUs in your mapper code, but that’s not exactly ideal

alexander_mattick · May 17, 2023, 4:40pm

Thank you, the linked solution worked, specifically calling ray.init() before import torch.

sangcho · June 1, 2023, 2:15pm

Hmm we should probably not inherit the CPU affinity setting on raylet. There seems to be no benefit from this behavior.

I will create an issue

Topic		Replies	Views
CPU cores, CPU threads, and scaling of Ray tasks Ray Core	1	197	June 25, 2024
maximize the parallelization efficiency using Python ray ActorPool?	4	697	November 15, 2022
Using a subset of available CPUs	2	475	December 12, 2020
Ray only using one CPU core but detects all resources Ray Core	4	1216	July 20, 2023
The use of python multiprocessing along with Ray Ray Core	8	1428	January 10, 2022

Low CPU utilization when compared to multiprocessing

Related topics