Resource deadlock in TorchTrainer?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi all,

is it possible to run remote tasks inside the training function of a TorchTrainer? Here is a minimal example which does not reach the print(“end”) command. So, I wonder how the train function can obtain enough resources to call the remote task f().

import ray
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

@ray.remote
def f():
    import time
    time.sleep(1)
    print("check")


def train_func():
    print("start")
    ray.get([f.remote() for i in range(10)])
    print("end")


def main():
    ray.init()
    scaling_config = ScalingConfig(num_workers=2)
    trainer = TorchTrainer(
                train_loop_per_worker=train_func,
                scaling_config=scaling_config)
    trainer.fit()    

if __name__ == "__main__":
    main()

Huaiwei (Ray team) has confirmed this issue in the Ray slack channel.

@justinvyu @matthewdeng
I’m not sure if this is a bug or something.
I can run it to the end by specifying cpu=0 for the task f.

@ray.remote(num_cpus=0)
def f():
    import time
    time.sleep(1)
    print("check")

The author created a gh issues here: [core][AIR] Deadlock in TorchTrainer? · Issue #32856 · ray-project/ray · GitHub

I think the default behavior doesn’t seem to work, and the workaround is setting num_cpus=0, is tantamount to asking an hybrid car driver to turn off the electric mode, and then the car will run on gas without problems.

IMHO, this should be considered as an issue.:slight_smile:

See [core][AIR] Deadlock in TorchTrainer? · Issue #32856 · ray-project/ray · GitHub for workaround + clarification.

Could you also mark this post as resolved?

@kersten Oh oops, I meant marking this discuss post as resolved, not the github issue.

That makes sense! Thanks for re-opening.