The use of python multiprocessing along with Ray

Hi, I’d like to find out if there is any issue with the use of python multiprocessing along with Ray (nested parallelism) regarding performance, serializing/deserializing objects, etc.?

import ray
ray.init()

import pandas
df = pandas.DataFrame([1])
o_ref = ray.put(df)

def f(obj):
    local_df = ray.get(o_ref)
    local_df += obj

from multiprocessing import Pool

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1]))

I’d be appreciate if you could point me to some docs on this if those exist.

Thanks in advance!

@sangcho, any comments/thoughts?

The normal Python multiprocessing will cause conflicts with nesting / excessive resource use (i.e., launching duplicated Ray clusters on the same machine), but you can use the Ray-integrated multiprocessing instead: Distributed multiprocessing.Pool — Ray v1.9.0

I don’t quite get it. What do you mean by duplicated Ray clusters? Will we have multiple driver processes that is not acceptable?

Yes, Ray should be managing all the parallelism of your program. Trying to mix processes with Ray is an anti pattern. Instead use Ray tasks for parallelism, or the ray multiprocessing library.

@ericl, thanks a lot! Btw, did you encounter any issue with the use of python multiprocessing along with Ray? Why am I asking is because the example I put works without any error.

@ericl, just a friendly reminder.

@YarShev, Ray is designed to manage the entire distributed application. If your application is launching multiple Ray clusters internally, that’s quite strange and can cause issues like running out of memory, not to mention the Ray clusters won’t be able to communicate with each other.

@ericl, thank you for the answer!