Remote function with multithreading does not get maximum cpu usages

I have a remote worker that invokes a parallelized numba function, like:

import numba
import time 
import ray

@numa.njit(parallel=True)
def nba_sum(x):
     r = 0
     for i in numba.prange(x.shape[0]):
        r += x[i] 
     return r

@ray.remote
def sum(shared_token):       
       x = <get the array view from shared token>
       nba_sum(x)    # so i ensure the numba function is already compiled
       t = time.time()
       nba_sum(x)
       return time.time() - t

x = <some very large array view to shared memory>
shared_token = <shared token to the memory of x>

t = time.time()
nba_sum(x)
print('local time =', time.time() - t)

print('remote time =', ray.get(sum.remote(shared_token))

I used a machine with 256 logical cpus (two AMD EPYC 7763 64-core Processors) to test the script and get that the output remote time always slower than the local one, about > 20% more time.
Can I avoid the cost?

hi @xjyu

Could the 20% overhead come from the data serialization cost when you pass x into sum? You can verify it by measuring the time inside the sum call.

Another possibility is you need to set the env variable OMP_NUM_THREADS=num_cpus before you start your script; since Ray by default set it to 1.

Thanks for reporting the issue @xjyu !

I’m going to mark this as resolved since @Chen_Shen provided directions to troubleshoot. Once you have a chance to try these we should reopen the discussion.

(Indeed I agree it could be serialization cost, depending on how large x is