Remote function with multithreading does not get maximum cpu usages

xjyu · October 11, 2022, 9:02am

I have a remote worker that invokes a parallelized numba function, like:

import numba
import time 
import ray

@numa.njit(parallel=True)
def nba_sum(x):
     r = 0
     for i in numba.prange(x.shape[0]):
        r += x[i] 
     return r

@ray.remote
def sum(shared_token):       
       x = <get the array view from shared token>
       nba_sum(x)    # so i ensure the numba function is already compiled
       t = time.time()
       nba_sum(x)
       return time.time() - t

x = <some very large array view to shared memory>
shared_token = <shared token to the memory of x>

t = time.time()
nba_sum(x)
print('local time =', time.time() - t)

print('remote time =', ray.get(sum.remote(shared_token))

I used a machine with 256 logical cpus (two AMD EPYC 7763 64-core Processors) to test the script and get that the output remote time always slower than the local one, about > 20% more time.
Can I avoid the cost?

Chen_Shen · October 15, 2022, 6:51pm

hi @xjyu

Could the 20% overhead come from the data serialization cost when you pass x into sum? You can verify it by measuring the time inside the sum call.

Another possibility is you need to set the env variable OMP_NUM_THREADS=num_cpus before you start your script; since Ray by default set it to 1.

zhz · December 2, 2022, 6:01am

Thanks for reporting the issue @xjyu !

I’m going to mark this as resolved since @Chen_Shen provided directions to troubleshoot. Once you have a chance to try these we should reopen the discussion.

(Indeed I agree it could be serialization cost, depending on how large x is

Topic		Replies	Views
Letting remote function use all CPUs? Ray Core	9	542	March 10, 2021
Ray only using two threads? Ray Core	5	612	May 12, 2021
Ray on single machine. No threading? Ray Core	10	2164	April 2, 2021
Tasks become slow when num of submitted task greater than num cpus Ray Core	1	315	November 23, 2021
Performance overhead numbers Ray Core	9	629	December 8, 2021

Remote function with multithreading does not get maximum cpu usages

Related topics