Running Tune within a remote function

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

We are in the process of upgrading Ray to version 2.6. We are using Ray Tune in a K8s cluster.

We are facing issues when trying to run Ray Tune within a remote function. When running the code below (minimal example taken from the tutorials), the HEAD node of the cluster sysmatically crash.

If we use a standard function instead of remote, then it is working fine (autoscaling, etc.).

It used to work with 2.2.

import time

import ray
from ray import tune
from ray.air import session


@ray.remote
def run():
    def evaluation_fn(step, width, height):
        time.sleep(0.1)
        return (0.1 + width * step / 100) ** (-1) + height * 0.1

    def easy_objective(config):
        width, height = config["width"], config["height"]

        for step in range(config["steps"]):
            intermediate_score = evaluation_fn(step, width, height)
            session.report(
                {"iterations": step, "mean_loss": intermediate_score}
            )

    tuner = tune.Tuner(
        easy_objective,
        tune_config=tune.TuneConfig(
            metric="mean_loss",
            mode="min",
            num_samples=50,
        ),
        param_space={
            "steps": 50,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            "activation": tune.grid_search(["relu", "tanh"]),
        },
    )
    results = tuner.fit()
    return results.get_dataframe()


if __name__ == "__main__":
    ray.init("ray://test-kuberay-head-svc:10001")
    res = ray.get(run.remote())

Hi @Cedric, do you have any logs from the failure? You’re seeing that the head node crashes and shuts down?

Have you also tried submitting the training task as a Ray Job?