Resource allocation issue for ray tune with horovod on k8s

Hi team,

I’m trying to run poc for ray tune with horovod + tf on k8s. My job cannot be processed due to the resource allocation issue even I think I have more than enough resources.

Here’s the ray cluster I launched:
1 Header node: {CPU: 4, GPU: 2, Memory: 64G} and 6 Worker nodes: {CPU: 16, GPU: 2, Memory: 64G}

Here’s my horovod trainable object

trainable = DistributedTrainableCreator(
            train_func,
            use_gpu=config["use_gpu"],
            num_workers=X,
            num_cpus_per_worker=16,
            replicate_pem=False,
            timeout_s=100,
        )

Here are the failed cases I observed,

  1. If num of workers > 3 and num_samples in ray.tune = 1, my jobs cannot be successfully scheduled and the error message is as below,
(WrappedHorovodTrainable pid=1744)     pg_timeout=self.settings.placement_group_timeout_s)
(WrappedHorovodTrainable pid=1744)   File "/home/jobuser/.shiv/cloudflow-dsl-example_1ad723c3c778eff40262addc5b4c0995164e49a795f5d2652d6e3c00bd8e6981/site-packages/horovod/ray/strategy.py", line 27, in create_placement_group
(WrappedHorovodTrainable pid=1744)     ray.available_resources(), pg.bundle_specs))
(WrappedHorovodTrainable pid=1744) TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:V100': 7.0, 'CPU': 36.0, 'GPU': 14.0, 'memory': 448000000000.0, 'object_store_memory': 130857013413.0, 'node:100.96.111.57': 1.0, 'node:100.96.168.69': 1.0, 'bundle_group_5c2c2753cce32fa8b80ff6c4b2706a34': 4000.0, 'CPU_group_5c2c2753cce32fa8b80ff6c4b2706a34': 64.0, 'CPU_group_2_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'bundle_group_2_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_3_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.153.61': 1.0, 'bundle_group_3_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'bundle_group_1_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_1_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.111.56': 1.0, 'bundle_group_0_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_0_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.182.47': 1.0, 'node:100.96.97.193': 0.99, 'node:100.96.182.46': 1.0}, resources requested by the placement group: [{'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}]
  1. As above, if I set num of workers <= 3 and num_samples in ray.tune = 1. It can work. However, if I set the num_samples = 2. The error happen again and error message is similar as above.
(WrappedHorovodTrainable pid=2863, ip=100.96.182.47)   File "/home/jobuser/.shiv/cloudflow-dsl-example_1ad723c3c778eff40262addc5b4c0995164e49a795f5d2652d6e3c00bd8e6981/site-packages/horovod/ray/strategy.py", line 27, in create_placement_group
(WrappedHorovodTrainable pid=2863, ip=100.96.182.47)     ray.available_resources(), pg.bundle_specs))
(WrappedHorovodTrainable pid=2863, ip=100.96.182.47) TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 448000000000.0, 'GPU': 14.0, 'CPU_group_098e09e0654a0e5a144c2a3de206d28d': 48.0, 'accelerator_type:V100': 7.0, 'object_store_memory': 130857013413.0, 'bundle_group_1_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'bundle_group_098e09e0654a0e5a144c2a3de206d28d': 3000.0, 'CPU_group_1_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'node:100.96.111.57': 1.0, 'CPU_group_2_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'bundle_group_da96cefefb6a0d86f9c37d809d50f9ec': 3000.0, 'node:100.96.168.69': 1.0, 'CPU_group_da96cefefb6a0d86f9c37d809d50f9ec': 48.0, 'bundle_group_2_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0, 'node:100.96.153.61': 1.0, 'CPU_group_0_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'bundle_group_0_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'bundle_group_1_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0, 'CPU_group_1_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'node:100.96.111.56': 1.0, 'node:100.96.182.47': 1.0, 'CPU_group_2_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'bundle_group_2_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'CPU': 4.0, 'node:100.96.97.193': 0.99, 'node:100.96.182.46': 1.0, 'CPU_group_0_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'bundle_group_0_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0}, resources requested by the placement group: [{'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}]

Does this mean the num_samples may affect the resource allocation? Originally I think the minimum resource requirements should be the resource for 1 trial.

Here’s the lib version I used

  • ray: 1.12.0
  • horovod: 0.22.1

Are all six nodes online? In the first error it looks like only 2 worker nodes are online, and thus the request for 4*16 = 64 CPUs can’t be placed.

How are you calling Ray Tune? Are you passing the correct resource requirements on to tune?

Generally, the DistributedTrainableCreator is deprecated. In Ray 2.0.0, we use Ray AIR’s HorovodTrainer instead, which has a simpler API and takes care of all resource requests automatically.

See e.g. Deep Learning User Guide — Ray 2.0.0
and Ray Train API — Ray 3.0.0.dev0

Is upgrading Ray an option for you?

Hi Kai,

Thanks for the reply. I tried to upgrade the horovod to 0.23.0 and the issue is fixed.