Hi, I’d like to find out if there is any issue with the use of python multiprocessing along with Ray (nested parallelism) regarding performance, serializing/deserializing objects, etc.?
import ray
ray.init()
import pandas
df = pandas.DataFrame([1])
o_ref = ray.put(df)
def f(obj):
local_df = ray.get(o_ref)
local_df += obj
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1]))
I’d be appreciate if you could point me to some docs on this if those exist.
The normal Python multiprocessing will cause conflicts with nesting / excessive resource use (i.e., launching duplicated Ray clusters on the same machine), but you can use the Ray-integrated multiprocessing instead: Distributed multiprocessing.Pool — Ray v1.9.0
Yes, Ray should be managing all the parallelism of your program. Trying to mix processes with Ray is an anti pattern. Instead use Ray tasks for parallelism, or the ray multiprocessing library.
@ericl, thanks a lot! Btw, did you encounter any issue with the use of python multiprocessing along with Ray? Why am I asking is because the example I put works without any error.
@YarShev, Ray is designed to manage the entire distributed application. If your application is launching multiple Ray clusters internally, that’s quite strange and can cause issues like running out of memory, not to mention the Ray clusters won’t be able to communicate with each other.