Deadlock with Ray Remote Function + Tune

raz4 · June 16, 2021, 7:49pm

Hello,

We’re trying to submit multiple tune runs concurrently using Remote Functions. Our current implementation looks like the following:

    @ray.remote(num_cpus=1)
    def submit_remote_tune(config):
        model_identifier = get_model_identifier(config['env_config'])
        print (f"Running tune job for {model_identifier}...")
        return tune.run(
            args.algo,
            stop={"timesteps_total": args.train_steps},
            config=config,
            verbose=2,
            local_dir='./ray_results',
            metric="timesteps_total",
            mode="max",
            name=model_identifier,
            checkpoint_at_end=True,
            num_samples=1,
            sync_config=None
        )

This approach works as expected ~50% of the time. When working correctly, all the tune trials go through the expected stages in the life cycle (pending, running, completed , etc.). When working incorrectly, the tune trials are stuck in “pending” for 5+ hrs when the job actually should take <20mins when working as expected. Also, I’ve noticed that while all these trials are in “PENDING” state, CPU usage increases to 75% but meaningful work doesn’t seem to be done according to logs. Also, we called that remote functions 12 times on a 16 CPU VM so there is no resource limitation.

Here is a snippet from the ending of the logs (after 5hrs):

2021-06-16 04:55:08.644667: CPU Usage: 70.0%,  Memory Usage: 62.1%, Memory Total/Available: 67551481856/25619558400
e[2me[36m(pid=382)e[0m == Status ==
e[2me[36m(pid=382)e[0m Memory usage on this node: 39.1/62.9 GiB
e[2me[36m(pid=382)e[0m Using FIFO scheduling algorithm.
e[2me[36m(pid=382)e[0m Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/37.44 GiB heap, 0.0/18.72 GiB objects
e[2me[36m(pid=382)e[0m Result logdir: /mnt/batch/tasks/shared/LS_root/jobs/workspace/azureml/training_20210615_1651_1623801124_ac66888a_head/mounts/workspaceblobstore/azureml/training_20210615_1651_1623801124_ac66888a_head/ray_results/39992_595973273
e[2me[36m(pid=382)e[0m Number of trials: 1/1 (1 PENDING)
e[2me[36m(pid=382)e[0m 
e[2me[36m(pid=382)e[0m 
e[2me[36m(pid=350)e[0m == Status ==
e[2me[36m(pid=350)e[0m Memory usage on this node: 39.1/62.9 GiB
e[2me[36m(pid=350)e[0m Using FIFO scheduling algorithm.
e[2me[36m(pid=350)e[0m Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/37.44 GiB heap, 0.0/18.72 GiB objects
e[2me[36m(pid=350)e[0m Result logdir: /mnt/batch/tasks/shared/LS_root/jobs/workspace/azureml/training_20210615_1651_1623801124_ac66888a_head/mounts/workspaceblobstore/azureml/training_20210615_1651_1623801124_ac66888a_head/ray_results/39992_393423198528
e[2me[36m(pid=350)e[0m Number of trials: 1/1 (1 PENDING)
e[2me[36m(pid=350)e[0m 
e[2me[36m(pid=350)e[0m 
e[2me[36m(pid=380)e[0m == Status ==
e[2me[36m(pid=380)e[0m Memory usage on this node: 39.0/62.9 GiB
e[2me[36m(pid=380)e[0m Using FIFO scheduling algorithm.
e[2me[36m(pid=380)e[0m Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/37.44 GiB heap, 0.0/18.72 GiB objects
e[2me[36m(pid=380)e[0m Result logdir: /mnt/batch/tasks/shared/LS_root/jobs/workspace/azureml/training_20210615_1651_1623801124_ac66888a_head/mounts/workspaceblobstore/azureml/training_20210615_1651_1623801124_ac66888a_head/ray_results/39992_5362341204
e[2me[36m(pid=380)e[0m Number of trials: 1/1 (1 PENDING)
e[2me[36m(pid=380)e[0m 
e[2me[36m(pid=380)e[0m 
e[2me[36m(pid=377)e[0m == Status ==
e[2me[36m(pid=377)e[0m Memory usage on this node: 39.0/62.9 GiB
e[2me[36m(pid=377)e[0m Using FIFO scheduling algorithm.
e[2me[36m(pid=377)e[0m Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/37.44 GiB heap, 0.0/18.72 GiB objects
e[2me[36m(pid=377)e[0m Result logdir: /mnt/batch/tasks/shared/LS_root/jobs/workspace/azureml/training_20210615_1651_1623801124_ac66888a_head/mounts/workspaceblobstore/azureml/training_20210615_1651_1623801124_ac66888a_head/ray_results/39992_4246157445
e[2me[36m(pid=377)e[0m Number of trials: 1/1 (1 PENDING)
e[2me[36m(pid=377)e[0m 
e[2me[36m(pid=377)e[0m

We’re running on Azure ML with a single 16 CPU, 64GB RAM VM. No worker nodes, just a single head node…

Starting Ray head...
Running with Ray version 1.4.0
Command: ray start --head --redis-shard-ports=6380,6381 --object-manager-port=12345 --node-manager-port=12346 --node-ip-address=10.0.0.14 --port=6379 --dashboard-host=0.0.0.0

Does anyone know why this method works sometimes and gets stuck other times? Is this method of submitting tune jobs concurrently with remote functions risky/not recommended? Any help on how to resolve this issue would be greatly appreciated. Also if there are other reliable methods to submit multiple tune jobs concurrently, please let me know.

Thank you!

NumberChiffre · June 16, 2021, 10:51pm

I had the same issue lately with PPO on ray==2.0.0dev, never happened with DQN for instance, though for me training would go on for ~6 hours before getting stuck. I am trying this on the most recent release ray==1.4.0 to see if this happens again before creating another issue.

rliaw · June 21, 2021, 4:21am

Hmm, this does seem a little odd. Can you post what happens when you just have 1 (not 12) parallel run, wrapping the tune.run call instead ray.remote?

rliaw · June 21, 2021, 4:22am

@NumberChiffre were you doing the same thing as OP? If not, it would be awesome to post an issue (or even try on the nightly wheels!).

Topic		Replies	Views
Trials with remote function calls not scheduled Ray Tune	1	449	December 1, 2021
Avoiding "too many workers" when calling tuner.fit() from remote functions	3	372	July 4, 2023
Clarification about invocation of remote tasks within trainable	0	94	February 20, 2024
Resource deadlock in TorchTrainer? Ray Train	5	493	February 27, 2023
Running Tune within a remote function Ray Tune	1	267	January 22, 2024

Deadlock with Ray Remote Function + Tune

Related topics