Ray not parallelising function

yudhiesh · May 28, 2022, 1:47am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have the following function that extracts the text out of HTML files, which I am trying to parallelise:

def extract_text_from_html(html_files):
    """
    Extract relevant text data from the HTML file
    """
    return [get_html_text(file=file) for file in html_files]

# Convert to remote function
extract_text_from_html_ = ray.remote(extract_text_from_html)

I am running it like so:

ray.init(num_cpus=psutil.cpu_count()) # 12 cores
html_files = load_html_files()
text_ids = extract_text_from_html_.remote(html_files)
texts = ray.get(text_ids)

Checking the Ray Dashboard when running the function I notice that only a single process is being run instead of 12, so I am confused why aren’t all the cores being used?

yudhiesh · May 28, 2022, 2:18am

I managed to fix the issue, but it seems that it is counterintuitive to what is mentioned in the documentation about Avoiding Tiny Tasks.
I had to convert the inner task to a remote function to utilise all the cores. Could I get an explanation about why this is the case?

html_files = load_html_files()
get_html_text_ = ray.remote(get_html_text)
text_ids = [get_html_text_.remote(file) for file in html_files]
print(ray.get(text_ids))

Note: get_html_text() loads the file and extracts the text out of it.

zhz · May 28, 2022, 3:19am

Thanks for the question @yudhiesh

Basically, you should be using an appropriate granularity that is neither too big (like in your original post, just one Task, nor too small (like in your second post).

Depending on how long a single get_html_text takes, you might want to batch a few of them into a Task. E.g. you can break html_files into a number of smaller lists.

yudhiesh · May 28, 2022, 6:19am

Thank you, yes batching the files solved it but I needed to run it on extract_text_from_html() instead.


def chunk_list(elements, batch_size):
    for i in range(0, len(elements), batch_size):
        yield elements[i : i + batch_size]

# Convert the function that handled a list of files into a remote function
extract_text_from_html_ = ray.remote(extract_text_from_html)

html_files = [file for file in chunk_list(load_html_files(), batch_size=100)]
text_ids = [extract_text_from_html_.remote(file) for file in html_files]
print(ray.get(text_ids))

All cores are being utilised now. Timings for both approaches were almost identical.

rliaw · June 2, 2022, 9:26pm

It seems like the fundamental confusion is as to what ray.remote does. It does not automatically “parallelize” your function call. Instead, it runs the function asynchronously on a background process.

If you want a “parallel application” of the function, you will have to launch multiple of these functions asynchronously:

remote_func = ray.remote(func)
remote_func.remote()  ## runs func once as a background task

[remote_func.remote() for i in range(100)]  # runs func as 100 background task

Does that make sense? Let me know if I captured your confusion and question properly.

yudhiesh · June 3, 2022, 1:00am

Thanks for clearing up the confusion, yes that answers it.

Topic		Replies	Views
Ray only using two threads? Ray Core	5	602	May 12, 2021
Ray on single machine. No threading? Ray Core	10	2142	April 2, 2021
Why just single process work? Ray Core	1	340	July 28, 2021
Is there any method to make ray task sharing with 1 cpu core? Like multithreading Ray Core	4	751	July 8, 2021
CPU cores, CPU threads, and scaling of Ray tasks Ray Core	1	225	June 25, 2024

Ray not parallelising function

Related topics