Resource deadlock in TorchTrainer?

kersten · February 27, 2023, 4:06pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi all,

is it possible to run remote tasks inside the training function of a TorchTrainer? Here is a minimal example which does not reach the print(“end”) command. So, I wonder how the train function can obtain enough resources to call the remote task f().

import ray
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

@ray.remote
def f():
    import time
    time.sleep(1)
    print("check")


def train_func():
    print("start")
    ray.get([f.remote() for i in range(10)])
    print("end")


def main():
    ray.init()
    scaling_config = ScalingConfig(num_workers=2)
    trainer = TorchTrainer(
                train_loop_per_worker=train_func,
                scaling_config=scaling_config)
    trainer.fit()    

if __name__ == "__main__":
    main()

Huaiwei (Ray team) has confirmed this issue in the Ray slack channel.

Huaiwei_Sun · February 27, 2023, 4:46pm

@justinvyu @matthewdeng
I’m not sure if this is a bug or something.
I can run it to the end by specifying cpu=0 for the task f.

@ray.remote(num_cpus=0)
def f():
    import time
    time.sleep(1)
    print("check")

The author created a gh issues here: [core][AIR] Deadlock in TorchTrainer? · Issue #32856 · ray-project/ray · GitHub

Jules_Damji · February 27, 2023, 5:40pm

I think the default behavior doesn’t seem to work, and the workaround is setting num_cpus=0, is tantamount to asking an hybrid car driver to turn off the electric mode, and then the car will run on gas without problems.

IMHO, this should be considered as an issue.

justinvyu · February 27, 2023, 7:44pm

See [core][AIR] Deadlock in TorchTrainer? · Issue #32856 · ray-project/ray · GitHub for workaround + clarification.

Could you also mark this post as resolved?

justinvyu · February 27, 2023, 8:00pm

@kersten Oh oops, I meant marking this discuss post as resolved, not the github issue.

kersten · February 27, 2023, 8:28pm

That makes sense! Thanks for re-opening.

Topic		Replies	Views
Progressive Slowdown and Deadlock in Ray Remote Tasks During Black-Box Optimization Ray Core	1	39	August 12, 2024
Deadlock with Ray Remote Function + Tune Ray Tune	3	392	June 21, 2021
Ray Questions( dynamic remotify + num_cpus for remote func that calls remote funcs) Ray Core	2	323	December 27, 2020
Limit resources for a group of tasks Ray Core	5	421	November 23, 2022
On Ray issue "Ray starts too many workers (and may crash) when using nested remote functions." Ray Core	4	559	December 1, 2020

Resource deadlock in TorchTrainer?

Related topics