Ray not utilized efficiently

user1337 · March 11, 2022, 1:58pm

Hi,

I am a new user to Ray and have a problem in which I have a list of combinations that I need to process. This list can contain 10000 elements, and instead of going through 1 by 1 in a traditional for loop, I thought that I could do it in parallell. E.g., that one worker does first 100 chunks, another does the next 100 chunks etc until all chunks are covered.

I have illustrated my Python flow below. With this flow, I am not noticing any extreme speed up compared to doing it without using Ray.

I am probably utilizing Ray in the wrong way and thought that perhaps someone in this forum could point out what I could do in order to utilize Ray in a correct an efficient manner.

import ray


# Create chunks of size n
def chunk(lst, n):
 for i in range(0, len(lst), n):
  yield lst[i:i+n]

# The function that executes computations
def process_combos(combo, dataframe1, dataframe2):
 ...
 res_df = some_other_func(combo, dataframe1, dataframe2)
 ...
 temp_df = ...
 return temp_df

@ray.remote
def run_chunks(chunks, dataframe1, dataframe2):
 return [process_combos(chunk, dataframe1, dataframe2) for chunk in chunks]

ray.init()

# all the combinations that I have to process (list might contain 10000 elements)
all_combos = [(1,2), (3,2), (1,9), ...]

chunk_size = 100
chunked_list = chunk(all_combos, chunk_size)
output = ray.get([run_chunks.remote(chunks, dataframe1, dataframe2) for chunks in chunked_list])

Stephanie_Wang · March 11, 2022, 4:47pm

A few things to check:

Do you know how much computation happens in run_chunks? As a rule of thumb, you will probably want these tasks to be at least 10s of ms to see a speedup from parallelism.
What is the parallelism on this machine? If the tasks are CPU-intensive, you will need more physical cores or additional machines to see a speedup.
How large is a single temp_df? If you are memory-bound, you can use ray.wait to wait for the results and process them one-by-one instead of calling ray.get, which pulls all outputs into driver memory simultaneously.

user1337 · March 11, 2022, 5:03pm

Thanks for your reply Stephanie.

As for your first point I can say that already the process_combos() function call seem to take about 0.5s to 1s. That is probably why I do not see any speedup. I am making a regression by calling a regression function defined in statsmodels.api. Preparing the data for the regression takes half of the time, and calling the regression function takes up the other half.
Not too sure how to answer your second point, I have a 4 core 8 thread processor (Ryzen 3500U). Running something that takes about 25 minutes without Ray, takes now ~13 minutes.
The temp_df is not large, it just takes a lot of time getting the results.

Stephanie_Wang · March 15, 2022, 2:10am

It sounds like you’re already doing the right things to get parallelism. With 4 cores, you can expect at most a 4x speedup, but it will likely be less than that since not all work in a program can be parallelized. You could check a shell utility like top to see if all of your cores are fully utilized during the execution.

Another thing to check is if the tasks are multithreaded, you can actually get interference from contention between the threads.

Topic		Replies	Views
How to effectively use parallelization with Ray in Python? Ray Core	5	3429	December 2, 2022
Using Ray to divide large list into two types Ray Core	1	237	November 10, 2021
Processing performance of tasks Ray Core	14	791	March 8, 2021
Parallelization using ray for a "for loop" add huge overhead in python Ray Core	4	743	April 8, 2023
Parallelising loops using ray Core Ray Core	1	226	October 12, 2023

Ray not utilized efficiently

Related topics