Understanding runtimes and task placement on hardware

Hi,

I also asked this question on StackOverflow but will try it here as well:

I am completely new to Ray and am trying to understand runtimes. I want to execute YOLO object detector from OpenCV. I wrote the following code to benchmark performance:

@ray.remote
def f(x):
    a, b, c = cv.detect_common_objects(x)

# reading in frame etc. omitted
# num_runs is the number of benchmarking runs 
# num_parallel is the number of parallel OpenCV function calls in Ray


ray.init(num_cpus=min(16, num_parallel))

# run with Ray in parallel
for j in range(num_runs):

    start = time.time()

    result_ids = []
    for i in range(num_parallel):
        result_ids.append(f.remote(frame))
    results = ray.get(result_ids)

    end = time.time()
    print(end-start)

# benchmark without Ray: Sequentially call OpenCV function
for j in range(num_runs):

    start_noray = time.time()
    
    for i in range(num_parallel):
        a, b, c = cv.detect_common_objects(frame)

    end_noray = time.time()
    print(end_noray-start_noray)


I’m running on a 16 core CPU. After warm up, the runtimes look as follows:

num_parallel Ray No ray, sequential
1 1900ms 350ms
8 2000ms 2900s
16 4000ms 6000ms

If I run top in another shell, it tells me that Ray is using 100% of the CPU in all 3 cases (also when num_parallel = 1). I am trying to understand these runtimes now:

The OpenCV function allows for parallel execution. The only explanation I can find for these runtimes is that each Ray worker is always placed on 2 CPU cores. But shouldn’t the CPU utilization in top be lower then? Also, is there a way how a worker is placed on more CPU cores like it is the case when just calling the function without using Ray?

PS: I also ommited the num_cpus=min(16, num_parallel) in ray.init() and this didn’t cahnge the runtimes. I have it now to make sure idle ray processes don’t push up the CPU utilization.

If I understand correctly, you are asking why Ray is always using 100% of CPU in all 3 cases? @ferdi_kossmann
cc: @jjyao

@ferdi_kossmann thanks for the question

A few things to consider:

  • each remote call creates a new task that is run in its own process - in the example it allows num_parallel tasks to run in parallel and each of them therefore runs in a separate process.
  • num_cpus is used by the scheduler for task execution, there is no explicit assignment between the processor and the CPU

To throttle the number of parallel tasks the recommendation is to set num_cpus to the cluster (ray.init) or to the individual tasks (ray.remote)

depending on what the input is (frame) if it is a large argument it may also impact performance - one way to avoid that is to first write the argument to the Ray object store (per ‘Tips’ in Ray Core API — Ray 2.1.0)

Let me know if this helps