How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I’m currently trying to use ray to parallelize pyscipopt MINLP solves with different configurations.
Doing so with traditional multiprocessing (i.e. multiprocessing pool) works and completely saturates all cores.
With ray (local cluster via ray.init()) however, I only ever get two cores saturated, both with low priority
Since these are multiple large jobs, I shouldn’t have any problem with overhead (as can be seen by the fact that traditional multiprocessing works).
I am sending a NN to the remotes (to be run on cpu), but that shouldn’t matter (at least it doesn’t for multiprocessing).
This problem persists regardless of whether I use @ray.remote decorators of ray.util.multiprocessing.Pool to just replace the multiprocessing pool with a ray pool.
I’m using 32GiB RAM, and an AMD Ryzen 7 5800X3D on ray 2.3.1.
As is, I’m unable to use ray to complete my task.
I managed to locate my problem:
The reason the other variant with multiprocessing is able to utilize more cores is due to me using mp.set_start_method("spawn") , without that the plain multiprocessing version is also constrained to only two cores.
Is there an equivalent (or similar) option I can try for ray?
@alexander_mattick If you create the Pool of actors equal to all the cores, it should use them all.
# Let's try that with Ray multiprocessing pool.
import multiprocessing as mp
from ray.util.multiprocessing import Pool
cpu_count = mp.cpu_count()
...
ray_pool = Pool(cpu_count)
lst = list(range(NUM))
results = []
for result in ray_pool.map(<your_function>):
results.append(result)
ray_pool.terminate()
That’s how I would expect it to work, but it doesn’t:
Here is a minimal version
import torch
from pyscipopt import Model
import multiprocessing as mp
from ray.util.multiprocessing import Pool as rayPool
import pyscipopt as scip
from scipy.spatial.distance import cdist
def make_tsp():
"""
USE MTZ formulation
"""
#g_cpu = torch.Generator()
#if seed is not None:
# g_cpu.manual_seed(seed)
# Define a distance matrix for the cities
size = 75
d = torch.randn(size,2,).numpy()*2
dist_matrix = cdist(d,d)
#print("TSP size",size)
# Create a SCIP model
model = Model("TSP")
# Define variables
num_cities = dist_matrix.shape[0]
x = {}
for i in range(num_cities):
for j in range(num_cities):
if i != j:
x[i,j] = model.addVar(vtype="B", name=f"x_{i}_{j}")
u={}
for i in range(1,num_cities):
u[i] = model.addVar(vtype="I", name=f"u_{i}")
model.addCons(1<=(u[i]<= num_cities-1), name=f"u_{i}_constraint")
# Define constraints
# Each city must be visited exactly once
for i in range(num_cities):
model.addCons(scip.quicksum(x[i,j] for j in range(num_cities) if j != i) == 1, name=f"city_{i}_visited_origin")
for j in range(num_cities):
model.addCons(scip.quicksum(x[i,j] for i in range(num_cities) if j != i) == 1, name=f"city_{j}_visited_dest")
# There should be no subtours
for i in range(1,num_cities):
for j in range(1,num_cities):
if i != j:
model.addCons(u[i] - u[j] + (num_cities - 1)*x[i,j]<= num_cities-2, name=f"no_subtour_{i}_{j}")
# Define objective
model.setObjective(scip.quicksum(dist_matrix[i,j] * x[i,j] for i in range(num_cities) for j in range(num_cities) if j != i), sense="minimize")
return model
def launch_models(pool, num_proc: int):
g = torch.Generator()
arg_list = []
for _ in range(num_proc):
seed = g.seed()
f = make_tsp
arg_list.append((seed, f))
result = pool.starmap_async(__make_and_optimize, arg_list)
g, bg = [], []
for res in result.get():
(gap, baseline_gap) = res.get()
g.append(gap)
bg.append(baseline_gap)
print("retrieved")
return g, bg
def __make_and_optimize(seed, f):
print("started")
torch.manual_seed(seed)
model = f()
model.setRealParam("limits/time", 60)
model.hideOutput()
model.optimize()
baseline_gap=model.getGap()
baseline_nodes = model.getNNodes()
model.freeTransform()
#model.freeProb()
model.setRealParam("limits/time", 60)
model.hideOutput()
model.optimize()
gap = model.getGap()
print("done converting, starting to send to main process")
return (gap, baseline_gap)
def main():
mp.set_start_method("spawn")
pool = mp.Pool(processes=16)
g,bg = launch_models(pool, 16)
print(g)
print(bg)
if __name__ == "__main__":
main()
It works with: mp.Pool and mp.set_start_method("spawn")
It doesn’t work with just mp.Pool or rayPool
So effectively, you saying with RayPool of 16 you only get 2 Actors running, even though you have 16 cores. For Ray equivalent Pool, you don’t need the “spawn”. It should create a pool of 16 actors.
In short: the problem isn’t that ray fundamentally produces to few workers, it’s that the workers it does produce are locked to the same core and therefore compete for cpu time while the rest of the system is idling.