Creating pool using cluster started with runtime_env

Looking for a couple explanations for some behaviors that appear not wholly explained in available documentation/issues. Really wasn’t sure the best way to ask, so included an example of the behaviors being exhibited. Ideally, would like to use runtime_env arguments in ray.init and have a Pool use the processes started in that cluster, rather than starting more (they do not need to utilize the runtime_env, though would not have an issue if they did). Thanks in advanced for any thoughts!

When a worker_process_setup_hook (and/or env_vars) are specified within an ray.init call through runtime_env, extra processes are started by subsequent creation of a multiprocessing pool. This somewhat complicates resource management (allows more active processes than there are processors). Manageable…sure, annoying, absolutely! If runtime_env arguments are not provided during ray.init, then subsequent Pool call uses/connects to processes already initialized in the cluster.

  1. There does not appear to be a way to add runtime_env arguments to the pool startup, such that existing cluster processes would be used (as is done when runtime_env arguments are not specified/included during ray.init). Is there a way to do so?

  2. env_vars can be specified under runtime_env arguments within a ray.remote header. However, worker_process_setup_hook cannot; returning a TypeError: Object of type function is not JSON serializable). If there is a way to do so, then this would be a suitable workaround; using runtime_env arguments per-actor rather than globally through ray.init.

  3. A less important side query: Running ray.init with num_cpus=2 opens 3 additional python processes? What is the 3rd process? Manager for the other 2? Subsequently, starting a pool (with runtime_env having been specified during the ray.init call) doubles the number of processes from 4 to 8…would appreciate some insights into what is going on here.

Example

#Windows 11 system with 32 threads
#Python process count from task manager

import ray
from ray.util.multiprocessing import Pool
def hook(): return None

#Running without a setup_hook
### 1 Python process running here
cxt = ray.init(num_cpus=2, include_dashboard=False)
### 4 Python processes running here; 1 original, 2 
pool = Pool(ray_address='auto')
### 4 Python processes running here

#Reset
ray.shutdown()

#With setup hook
### 1 Python process running here
cxt = ray.init(num_cpus=2, include_dashboard=False, runtime_env={"worker_process_setup_hook": hook})
### 4 Python processes running here
pool = Pool(ray_address='auto')
### 8 Python processes running here

It seems likely that when you specify a runtime_env (such as worker_process_setup_hook or env_vars) in ray.init, Ray creates a new runtime environment for the driver and all workers started for that job. When you then create a Pool, it starts a new set of actor processes (one per pool worker) that do not reuse the existing worker processes, but instead launch additional ones, each with their own runtime environment. This is why you see the number of Python processes double after creating the pool—Ray is not reusing the original workers, but starting new actors for the pool, each with the specified environment. This behavior is confirmed in the Ray multiprocessing pool source and discussed in Ray issues.

There does not appear to be a supported way to have the Pool reuse already-initialized cluster workers with a specific runtime_env—the pool always creates its own actors. Also, worker_process_setup_hook cannot be passed per-actor via runtime_env because functions are not JSON serializable, as you observed. As for the extra process: with num_cpus=2, Ray starts a driver, two worker processes, and a Raylet (node manager) process, which explains the count. When you start a pool, it creates additional actors, further increasing the process count. This is consistent with the Ray internals and multiprocessing pool implementation.

Would you like a more detailed breakdown of the process lifecycle or resource management in this scenario?

Sources:

Hint: Mention @RunLLM in the post for followups.

Opened related issue, regarding use of runtime_env with ActorPool: #62442