Calling an application that relies on ray inside a remote function

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I want to run autogluon (which uses ray internally) on multiple datasets, so I basically have a for loop over datasets and I call autogluon on each dataset. I have several identical VMs (8 CPUs, 16 GB of RAM) and my idea was to create a ray cluster with all these VMs, and have each VM run one task (one dataset) and let ray handle the scheduling and catch out memory errors etc… It was important for me to have one task per VM so that I was sure each autogluon run was centralized on one VM and with similar resources. I know I can limit the resources of each task but I wanted to avoid data exchange etc… between different VMs for one autogluon run. Two issues I had:

  1. As far as I understood ray is not meant to specify this kind of isolation between resources (allocate one VM for each task) but more to specify the amount of resources for each each task.
  2. I create my global ray cluster with the different VMs and then autogluon is also using ray internally and I do not control the autogluon code. I wanted autogluon to only use the VM resources on which it is run and not the whole cluster resources.

It seems I kind of succeeded to achieve what I wanted to do but I would be happy to have any feedbacks or suggestions on what I could do better and also some clarifications why this solution is doing what I want. Especially since it seems strange to me that I cannot explicitly force autogluon to use a given ray cluster (from a given ray.init).

I have a main script calling ray.init() which will use the global ray cluster I create from the command line (ray start on the head and worker nodes, each node being a VM here) and the remote function for each dataset calling autogluon from a subprocess (so that I can isolate the ray clusters, but I am not sure about the behavior?)

import subprocess
import ray

@ray.remote(num_cpus=8, max_retries=0)
def run_dataset(dataset):

    result = subprocess.run(
        ["python", "autogluon_script.py", dataset],
        capture_output=True,
        text=True,
    )

    if result.returncode != 0:
        print(f"Error calling script: {result.stderr}")
        return None

    return result.stdout

ray.init()
futures = [run_dataset.remote(dataset) for dataset in datasets]
results = ray.get(futures)

Then the autogluon_script.py is doing the following

from autogluon.tabular import TabularPredictor

# make sure autogluon internal ray will use the local cluster and not the global one.
ray.init(address="local")

predictor = TabularPredictor().fit(
    dataset,
    label='y',
    num_cpus=8,
)

If I don’t use a subprocess, it seems that for a given dataset/task autogluon is using all the global cluster resources (only 8 CPUs but from different VMs). If I use a subprocess but not ray.init(address="local") I have the same issue. My current solution seems to work but it is also not clear to me why autogluon is using the local cluster without me explicitly passing this local cluster to the autogluon call.

This is related to my question here Understanding "Lifetimes of a User-Spawn Process" (I need to use RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true for the subprocess to be killed when a worker is killed).