Ray not utilized efficiently

Hi,

I am a new user to Ray and have a problem in which I have a list of combinations that I need to process. This list can contain 10000 elements, and instead of going through 1 by 1 in a traditional for loop, I thought that I could do it in parallell. E.g., that one worker does first 100 chunks, another does the next 100 chunks etc until all chunks are covered.

I have illustrated my Python flow below. With this flow, I am not noticing any extreme speed up compared to doing it without using Ray.

I am probably utilizing Ray in the wrong way and thought that perhaps someone in this forum could point out what I could do in order to utilize Ray in a correct an efficient manner.

import ray


# Create chunks of size n
def chunk(lst, n):
 for i in range(0, len(lst), n):
  yield lst[i:i+n]

# The function that executes computations
def process_combos(combo, dataframe1, dataframe2):
 ...
 res_df = some_other_func(combo, dataframe1, dataframe2)
 ...
 temp_df = ...
 return temp_df

@ray.remote
def run_chunks(chunks, dataframe1, dataframe2):
 return [process_combos(chunk, dataframe1, dataframe2) for chunk in chunks]

ray.init()

# all the combinations that I have to process (list might contain 10000 elements)
all_combos = [(1,2), (3,2), (1,9), ...]

chunk_size = 100
chunked_list = chunk(all_combos, chunk_size)
output = ray.get([run_chunks.remote(chunks, dataframe1, dataframe2) for chunks in chunked_list])

A few things to check:

  • Do you know how much computation happens in run_chunks? As a rule of thumb, you will probably want these tasks to be at least 10s of ms to see a speedup from parallelism.
  • What is the parallelism on this machine? If the tasks are CPU-intensive, you will need more physical cores or additional machines to see a speedup.
  • How large is a single temp_df? If you are memory-bound, you can use ray.wait to wait for the results and process them one-by-one instead of calling ray.get, which pulls all outputs into driver memory simultaneously.
1 Like

Thanks for your reply Stephanie.

  • As for your first point I can say that already the process_combos() function call seem to take about 0.5s to 1s. That is probably why I do not see any speedup. I am making a regression by calling a regression function defined in statsmodels.api. Preparing the data for the regression takes half of the time, and calling the regression function takes up the other half.

  • Not too sure how to answer your second point, I have a 4 core 8 thread processor (Ryzen 3500U). Running something that takes about 25 minutes without Ray, takes now ~13 minutes.

  • The temp_df is not large, it just takes a lot of time getting the results.

It sounds like you’re already doing the right things to get parallelism. With 4 cores, you can expect at most a 4x speedup, but it will likely be less than that since not all work in a program can be parallelized. You could check a shell utility like top to see if all of your cores are fully utilized during the execution.

Another thing to check is if the tasks are multithreaded, you can actually get interference from contention between the threads.