Ray not parallelising function

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have the following function that extracts the text out of HTML files, which I am trying to parallelise:

def extract_text_from_html(html_files):
    Extract relevant text data from the HTML file
    return [get_html_text(file=file) for file in html_files]

# Convert to remote function
extract_text_from_html_ = ray.remote(extract_text_from_html)

I am running it like so:

ray.init(num_cpus=psutil.cpu_count()) # 12 cores
html_files = load_html_files()
text_ids = extract_text_from_html_.remote(html_files)
texts = ray.get(text_ids)

Checking the Ray Dashboard when running the function I notice that only a single process is being run instead of 12, so I am confused why aren’t all the cores being used?

I managed to fix the issue, but it seems that it is counterintuitive to what is mentioned in the documentation about Avoiding Tiny Tasks.
I had to convert the inner task to a remote function to utilise all the cores. Could I get an explanation about why this is the case?

html_files = load_html_files()
get_html_text_ = ray.remote(get_html_text)
text_ids = [get_html_text_.remote(file) for file in html_files]

Note: get_html_text() loads the file and extracts the text out of it.

Thanks for the question @yudhiesh

Basically, you should be using an appropriate granularity that is neither too big (like in your original post, just one Task, nor too small (like in your second post).

Depending on how long a single get_html_text takes, you might want to batch a few of them into a Task. E.g. you can break html_files into a number of smaller lists.

1 Like

Thank you, yes batching the files solved it but I needed to run it on extract_text_from_html() instead.

def chunk_list(elements, batch_size):
    for i in range(0, len(elements), batch_size):
        yield elements[i : i + batch_size]

# Convert the function that handled a list of files into a remote function
extract_text_from_html_ = ray.remote(extract_text_from_html)

html_files = [file for file in chunk_list(load_html_files(), batch_size=100)]
text_ids = [extract_text_from_html_.remote(file) for file in html_files]

All cores are being utilised now. Timings for both approaches were almost identical.

It seems like the fundamental confusion is as to what ray.remote does. It does not automatically “parallelize” your function call. Instead, it runs the function asynchronously on a background process.

If you want a “parallel application” of the function, you will have to launch multiple of these functions asynchronously:

remote_func = ray.remote(func)
remote_func.remote()  ## runs func once as a background task

[remote_func.remote() for i in range(100)]  # runs func as 100 background task

Does that make sense? Let me know if I captured your confusion and question properly.


Thanks for clearing up the confusion, yes that answers it.