Understanding runtimes and task placement on hardware


I also asked this question on StackOverflow but will try it here as well:

I am completely new to Ray and am trying to understand runtimes. I want to execute YOLO object detector from OpenCV. I wrote the following code to benchmark performance:

def f(x):
    a, b, c = cv.detect_common_objects(x)

# reading in frame etc. omitted
# num_runs is the number of benchmarking runs 
# num_parallel is the number of parallel OpenCV function calls in Ray

ray.init(num_cpus=min(16, num_parallel))

# run with Ray in parallel
for j in range(num_runs):

    start = time.time()

    result_ids = []
    for i in range(num_parallel):
    results = ray.get(result_ids)

    end = time.time()

# benchmark without Ray: Sequentially call OpenCV function
for j in range(num_runs):

    start_noray = time.time()
    for i in range(num_parallel):
        a, b, c = cv.detect_common_objects(frame)

    end_noray = time.time()

I’m running on a 16 core CPU. After warm up, the runtimes look as follows:

num_parallel Ray No ray, sequential
1 1900ms 350ms
8 2000ms 2900s
16 4000ms 6000ms

If I run top in another shell, it tells me that Ray is using 100% of the CPU in all 3 cases (also when num_parallel = 1). I am trying to understand these runtimes now:

The OpenCV function allows for parallel execution. The only explanation I can find for these runtimes is that each Ray worker is always placed on 2 CPU cores. But shouldn’t the CPU utilization in top be lower then? Also, is there a way how a worker is placed on more CPU cores like it is the case when just calling the function without using Ray?

PS: I also ommited the num_cpus=min(16, num_parallel) in ray.init() and this didn’t cahnge the runtimes. I have it now to make sure idle ray processes don’t push up the CPU utilization.