Calling an application that relies on ray inside a remote function

albertcthomas · December 20, 2024, 10:39am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I want to run autogluon (which uses ray internally) on multiple datasets, so I basically have a for loop over datasets and I call autogluon on each dataset. I have several identical VMs (8 CPUs, 16 GB of RAM) and my idea was to create a ray cluster with all these VMs, and have each VM run one task (one dataset) and let ray handle the scheduling and catch out memory errors etc… It was important for me to have one task per VM so that I was sure each autogluon run was centralized on one VM and with similar resources. I know I can limit the resources of each task but I wanted to avoid data exchange etc… between different VMs for one autogluon run. Two issues I had:

As far as I understood ray is not meant to specify this kind of isolation between resources (allocate one VM for each task) but more to specify the amount of resources for each each task.
I create my global ray cluster with the different VMs and then autogluon is also using ray internally and I do not control the autogluon code. I wanted autogluon to only use the VM resources on which it is run and not the whole cluster resources.

It seems I kind of succeeded to achieve what I wanted to do but I would be happy to have any feedbacks or suggestions on what I could do better and also some clarifications why this solution is doing what I want. Especially since it seems strange to me that I cannot explicitly force autogluon to use a given ray cluster (from a given ray.init).

I have a main script calling ray.init() which will use the global ray cluster I create from the command line (ray start on the head and worker nodes, each node being a VM here) and the remote function for each dataset calling autogluon from a subprocess (so that I can isolate the ray clusters, but I am not sure about the behavior?)

import subprocess
import ray

@ray.remote(num_cpus=8, max_retries=0)
def run_dataset(dataset):

    result = subprocess.run(
        ["python", "autogluon_script.py", dataset],
        capture_output=True,
        text=True,
    )

    if result.returncode != 0:
        print(f"Error calling script: {result.stderr}")
        return None

    return result.stdout

ray.init()
futures = [run_dataset.remote(dataset) for dataset in datasets]
results = ray.get(futures)

Then the autogluon_script.py is doing the following

from autogluon.tabular import TabularPredictor

# make sure autogluon internal ray will use the local cluster and not the global one.
ray.init(address="local")

predictor = TabularPredictor().fit(
    dataset,
    label='y',
    num_cpus=8,
)

If I don’t use a subprocess, it seems that for a given dataset/task autogluon is using all the global cluster resources (only 8 CPUs but from different VMs). If I use a subprocess but not ray.init(address="local") I have the same issue. My current solution seems to work but it is also not clear to me why autogluon is using the local cluster without me explicitly passing this local cluster to the autogluon call.

This is related to my question here Understanding "Lifetimes of a User-Spawn Process" (I need to use RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true for the subprocess to be killed when a worker is killed).

jjyao · February 16, 2025, 5:43am

4. If the provided address is "local", start a new local Ray
            instance, even if there is already an existing local Ray instance.

By using address="local" you creates another Ray cluster. I’m not familiar with autogluon code but the way to only run tasks on a single vm/node is that you give each node a unique custom resource and for all tasks you want to run on a single vm, you request for that particular custom resource.

albertcthomas · February 18, 2025, 8:38am

Thanks a lot @jjyao for your response. About what you said above can you share an example or point me to related resources/documentation. It is not very clear to me how to do that. Let’s say I have 2 VMs/nodes, a and b, and I want to run 10 tasks, one task using exactly one VM. How would I achieve that without affecting task 1 to node a explicitly for instance?

Topic		Replies	Views
Running Ray on Slurm Cluster	11	1250	January 31, 2021
Ray exec multiple scripts w/ tune.run() to same ray cluster Ray Tune	18	1488	February 14, 2021
Use Ray to parallelize tasks	3	433	February 22, 2021
All remote tasks are scheduled to one node Ray Clusters	0	431	August 28, 2021
Remote function not working as expected Ray Client	1	743	June 29, 2021

Calling an application that relies on ray inside a remote function

Related topics