Deadlock with Ray Remote Function + Tune

Hello,

We’re trying to submit multiple tune runs concurrently using Remote Functions. Our current implementation looks like the following:

    @ray.remote(num_cpus=1)
    def submit_remote_tune(config):
        model_identifier = get_model_identifier(config['env_config'])
        print (f"Running tune job for {model_identifier}...")
        return tune.run(
            args.algo,
            stop={"timesteps_total": args.train_steps},
            config=config,
            verbose=2,
            local_dir='./ray_results',
            metric="timesteps_total",
            mode="max",
            name=model_identifier,
            checkpoint_at_end=True,
            num_samples=1,
            sync_config=None
        )

This approach works as expected ~50% of the time. When working correctly, all the tune trials go through the expected stages in the life cycle (pending, running, completed , etc.). When working incorrectly, the tune trials are stuck in “pending” for 5+ hrs when the job actually should take <20mins when working as expected. Also, I’ve noticed that while all these trials are in “PENDING” state, CPU usage increases to 75% but meaningful work doesn’t seem to be done according to logs. Also, we called that remote functions 12 times on a 16 CPU VM so there is no resource limitation.

Here is a snippet from the ending of the logs (after 5hrs):

2021-06-16 04:55:08.644667: CPU Usage: 70.0%,  Memory Usage: 62.1%, Memory Total/Available: 67551481856/25619558400
e[2me[36m(pid=382)e[0m == Status ==
e[2me[36m(pid=382)e[0m Memory usage on this node: 39.1/62.9 GiB
e[2me[36m(pid=382)e[0m Using FIFO scheduling algorithm.
e[2me[36m(pid=382)e[0m Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/37.44 GiB heap, 0.0/18.72 GiB objects
e[2me[36m(pid=382)e[0m Result logdir: /mnt/batch/tasks/shared/LS_root/jobs/workspace/azureml/training_20210615_1651_1623801124_ac66888a_head/mounts/workspaceblobstore/azureml/training_20210615_1651_1623801124_ac66888a_head/ray_results/39992_595973273
e[2me[36m(pid=382)e[0m Number of trials: 1/1 (1 PENDING)
e[2me[36m(pid=382)e[0m 
e[2me[36m(pid=382)e[0m 
e[2me[36m(pid=350)e[0m == Status ==
e[2me[36m(pid=350)e[0m Memory usage on this node: 39.1/62.9 GiB
e[2me[36m(pid=350)e[0m Using FIFO scheduling algorithm.
e[2me[36m(pid=350)e[0m Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/37.44 GiB heap, 0.0/18.72 GiB objects
e[2me[36m(pid=350)e[0m Result logdir: /mnt/batch/tasks/shared/LS_root/jobs/workspace/azureml/training_20210615_1651_1623801124_ac66888a_head/mounts/workspaceblobstore/azureml/training_20210615_1651_1623801124_ac66888a_head/ray_results/39992_393423198528
e[2me[36m(pid=350)e[0m Number of trials: 1/1 (1 PENDING)
e[2me[36m(pid=350)e[0m 
e[2me[36m(pid=350)e[0m 
e[2me[36m(pid=380)e[0m == Status ==
e[2me[36m(pid=380)e[0m Memory usage on this node: 39.0/62.9 GiB
e[2me[36m(pid=380)e[0m Using FIFO scheduling algorithm.
e[2me[36m(pid=380)e[0m Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/37.44 GiB heap, 0.0/18.72 GiB objects
e[2me[36m(pid=380)e[0m Result logdir: /mnt/batch/tasks/shared/LS_root/jobs/workspace/azureml/training_20210615_1651_1623801124_ac66888a_head/mounts/workspaceblobstore/azureml/training_20210615_1651_1623801124_ac66888a_head/ray_results/39992_5362341204
e[2me[36m(pid=380)e[0m Number of trials: 1/1 (1 PENDING)
e[2me[36m(pid=380)e[0m 
e[2me[36m(pid=380)e[0m 
e[2me[36m(pid=377)e[0m == Status ==
e[2me[36m(pid=377)e[0m Memory usage on this node: 39.0/62.9 GiB
e[2me[36m(pid=377)e[0m Using FIFO scheduling algorithm.
e[2me[36m(pid=377)e[0m Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/37.44 GiB heap, 0.0/18.72 GiB objects
e[2me[36m(pid=377)e[0m Result logdir: /mnt/batch/tasks/shared/LS_root/jobs/workspace/azureml/training_20210615_1651_1623801124_ac66888a_head/mounts/workspaceblobstore/azureml/training_20210615_1651_1623801124_ac66888a_head/ray_results/39992_4246157445
e[2me[36m(pid=377)e[0m Number of trials: 1/1 (1 PENDING)
e[2me[36m(pid=377)e[0m 
e[2me[36m(pid=377)e[0m 

We’re running on Azure ML with a single 16 CPU, 64GB RAM VM. No worker nodes, just a single head node…

Starting Ray head...
Running with Ray version 1.4.0
Command: ray start --head --redis-shard-ports=6380,6381 --object-manager-port=12345 --node-manager-port=12346 --node-ip-address=10.0.0.14 --port=6379 --dashboard-host=0.0.0.0 

Does anyone know why this method works sometimes and gets stuck other times? Is this method of submitting tune jobs concurrently with remote functions risky/not recommended? Any help on how to resolve this issue would be greatly appreciated. Also if there are other reliable methods to submit multiple tune jobs concurrently, please let me know.

Thank you!

I had the same issue lately with PPO on ray==2.0.0dev, never happened with DQN for instance, though for me training would go on for ~6 hours before getting stuck. I am trying this on the most recent release ray==1.4.0 to see if this happens again before creating another issue.

Hmm, this does seem a little odd. Can you post what happens when you just have 1 (not 12) parallel run, wrapping the tune.run call instead ray.remote?

@NumberChiffre were you doing the same thing as OP? If not, it would be awesome to post an issue (or even try on the nightly wheels!).