[Core] I found Ray ran numpy.dot slower than numpy.dot not run in ray

Hi, I m testing performance with ray.
I used the script below

import time
import ray
import numpy as np
def big_dot():
    s1 = time.time()
    a1 = np.random.random((10000, 10000))
    a2 = np.random.random((10000, 10000))
    a3 = np.dot(a1, a2)
    s2 = time.time()
    print(f'total time: {s2 - s1}')
    return s2 - s1
def big_dot_remote():
    return big_dot()
if __name__ == '__main__':

Then I got the result below

(pid=72495, ip= (10000, 10000)
(pid=72495, ip= total time: 41.802658796310425
(10000, 10000)
total time: 12.533077955245972

I was surprised with this result that big_dot_remote was 3 times slower than big_dot
Can anyone tell me why ray run this much slower?

My computer is macbookpro with 2.9 GHz 6-Core Intel Core i9

I don’t know why it would be slower but as I understand it, numpy will only use one CPU for this calculation. You need to leverage a distributed computing library like modin to transform the numpy array into a distributed compute object.

I wonder if it is the ray.get() method is a blocking function, which takes so much time to initialize and release memory after calculation. As grechaw said, I think you could compare the time usage of python built-in multiprocessing and ray actor framework.

Check this piece of codes:

from multiprocessing import Pool
import numpy as np
import ray
import time

def multiply(input):
    return np.dot(input, input)

class Actor:
    def __init__(self):

    def run(self, input):
        return np.dot(input, input)

def make_generator(obj_ids):
    while obj_ids:
        done, obj_ids = ray.wait(obj_ids)
        yield ray.get(done[0])

if __name__ == '__main__':
    worker = 4
    testing_inputs = [np.random.random((5000, 5000)) for _ in range(16)]
    start1 = time.time()
    with Pool(worker) as p:
        output1 = p.map(multiply, testing_inputs)
    end1 = time.time()
    print("multiprocessing takes {}".format(end1 - start1))

    start2 = time.time()
    workers = [Actor.remote() for _ in range(worker)]
    obj_ids = [workers[i % worker].run.remote(input) for i, input in enumerate(testing_inputs)]
    output2 = [i for i in make_generator(obj_ids)]
    end2 = time.time()
    print("ray takes {}".format(end2 - start2))

Save 25 percent of the time for my runs on 4 workers.

I think it’s actually because ray automatically sets OMP_NUM_THREADS in its initialization of worker functions. Could you try setting it to something higher and then import numpy?


Thx, rliaw. It’s worked !

@rliaw shouldn’t we automatically set OMP_NUM_THREADS to num_cpus? Or am I missing something here?

Nope, you’re right! Assign me as a reviewer for your PR, thanks.