Resource allocation issue for ray tune with horovod on k8s

WEI-YU_YEN · September 13, 2022, 6:48pm

Hi team,

I’m trying to run poc for ray tune with horovod + tf on k8s. My job cannot be processed due to the resource allocation issue even I think I have more than enough resources.

Here’s the ray cluster I launched:
1 Header node: {CPU: 4, GPU: 2, Memory: 64G} and 6 Worker nodes: {CPU: 16, GPU: 2, Memory: 64G}

Here’s my horovod trainable object

trainable = DistributedTrainableCreator(
            train_func,
            use_gpu=config["use_gpu"],
            num_workers=X,
            num_cpus_per_worker=16,
            replicate_pem=False,
            timeout_s=100,
        )

Here are the failed cases I observed,

If num of workers > 3 and num_samples in ray.tune = 1, my jobs cannot be successfully scheduled and the error message is as below,

(WrappedHorovodTrainable pid=1744)     pg_timeout=self.settings.placement_group_timeout_s)
(WrappedHorovodTrainable pid=1744)   File "/home/jobuser/.shiv/cloudflow-dsl-example_1ad723c3c778eff40262addc5b4c0995164e49a795f5d2652d6e3c00bd8e6981/site-packages/horovod/ray/strategy.py", line 27, in create_placement_group
(WrappedHorovodTrainable pid=1744)     ray.available_resources(), pg.bundle_specs))
(WrappedHorovodTrainable pid=1744) TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:V100': 7.0, 'CPU': 36.0, 'GPU': 14.0, 'memory': 448000000000.0, 'object_store_memory': 130857013413.0, 'node:100.96.111.57': 1.0, 'node:100.96.168.69': 1.0, 'bundle_group_5c2c2753cce32fa8b80ff6c4b2706a34': 4000.0, 'CPU_group_5c2c2753cce32fa8b80ff6c4b2706a34': 64.0, 'CPU_group_2_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'bundle_group_2_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_3_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.153.61': 1.0, 'bundle_group_3_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'bundle_group_1_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_1_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.111.56': 1.0, 'bundle_group_0_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_0_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.182.47': 1.0, 'node:100.96.97.193': 0.99, 'node:100.96.182.46': 1.0}, resources requested by the placement group: [{'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}]

As above, if I set num of workers <= 3 and num_samples in ray.tune = 1. It can work. However, if I set the num_samples = 2. The error happen again and error message is similar as above.

(WrappedHorovodTrainable pid=2863, ip=100.96.182.47)   File "/home/jobuser/.shiv/cloudflow-dsl-example_1ad723c3c778eff40262addc5b4c0995164e49a795f5d2652d6e3c00bd8e6981/site-packages/horovod/ray/strategy.py", line 27, in create_placement_group
(WrappedHorovodTrainable pid=2863, ip=100.96.182.47)     ray.available_resources(), pg.bundle_specs))
(WrappedHorovodTrainable pid=2863, ip=100.96.182.47) TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 448000000000.0, 'GPU': 14.0, 'CPU_group_098e09e0654a0e5a144c2a3de206d28d': 48.0, 'accelerator_type:V100': 7.0, 'object_store_memory': 130857013413.0, 'bundle_group_1_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'bundle_group_098e09e0654a0e5a144c2a3de206d28d': 3000.0, 'CPU_group_1_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'node:100.96.111.57': 1.0, 'CPU_group_2_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'bundle_group_da96cefefb6a0d86f9c37d809d50f9ec': 3000.0, 'node:100.96.168.69': 1.0, 'CPU_group_da96cefefb6a0d86f9c37d809d50f9ec': 48.0, 'bundle_group_2_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0, 'node:100.96.153.61': 1.0, 'CPU_group_0_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'bundle_group_0_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'bundle_group_1_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0, 'CPU_group_1_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'node:100.96.111.56': 1.0, 'node:100.96.182.47': 1.0, 'CPU_group_2_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'bundle_group_2_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'CPU': 4.0, 'node:100.96.97.193': 0.99, 'node:100.96.182.46': 1.0, 'CPU_group_0_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'bundle_group_0_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0}, resources requested by the placement group: [{'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}]

Does this mean the num_samples may affect the resource allocation? Originally I think the minimum resource requirements should be the resource for 1 trial.

Here’s the lib version I used

ray: 1.12.0
horovod: 0.22.1

kai · September 14, 2022, 11:31am

Are all six nodes online? In the first error it looks like only 2 worker nodes are online, and thus the request for 4*16 = 64 CPUs can’t be placed.

How are you calling Ray Tune? Are you passing the correct resource requirements on to tune?

Generally, the DistributedTrainableCreator is deprecated. In Ray 2.0.0, we use Ray AIR’s HorovodTrainer instead, which has a simpler API and takes care of all resource requests automatically.

See e.g. Deep Learning User Guide — Ray 2.0.0
and Ray Train API — Ray 3.0.0.dev0

Is upgrading Ray an option for you?

WEI-YU_YEN · September 15, 2022, 10:16pm

Hi Kai,

Thanks for the reply. I tried to upgrade the horovod to 0.23.0 and the issue is fixed.

github.com/horovod/horovod

Make RayExecutor use the current placement group if one exists

horovod:master ← Yard1:horovod_ray_inherit_pg

opened 04:29PM - 27 Aug 21 UTC

Yard1

+93 -21

## Checklist before submitting - [x] Did you read the [contributor guide](htt…ps://github.com/horovod/horovod/blob/master/CONTRIBUTING.md)? - [x] Did you update the docs? - [x] Did you write any tests to validate this change? - [x] Did you update the [CHANGELOG](https://github.com/horovod/horovod/blob/master/CHANGELOG.md), if this change affects users? ## Description Adds a `PGStrategy` to `RayExecutor`, which will automatically capture the placement group should one be currently present, and use it for Horovod. ## Review process to land 1. All tests and other checks must succeed. 2. At least one member of the [technical steering committee](https://github.com/horovod/horovod/blob/master/CONTRIBUTING.md) must review and approve. 3. If any member of the technical steering committee requests changes, they must be addressed.

Topic		Replies	Views
Ray actors cannot be scheduled due to resources constraints	19	2159	November 10, 2022
Reserve workers on GPU node for trainer workers only RLlib	7	1121	June 3, 2022
Tune not autoscaling on Kubernetes Ray Tune	2	458	February 22, 2021
Question about Ray Cluster/ Ray on prem Ray Clusters	6	750	June 15, 2021
Distributed Training & Distributed Tuning using Ray Tune, PLT, Ray Lightning Ray Clusters	1	377	April 25, 2022

Resource allocation issue for ray tune with horovod on k8s

Related topics