Hi team,
I’m trying to run poc for ray tune with horovod + tf on k8s. My job cannot be processed due to the resource allocation issue even I think I have more than enough resources.
Here’s the ray cluster I launched:
1 Header node: {CPU: 4, GPU: 2, Memory: 64G} and 6 Worker nodes: {CPU: 16, GPU: 2, Memory: 64G}
Here’s my horovod trainable object
trainable = DistributedTrainableCreator(
train_func,
use_gpu=config["use_gpu"],
num_workers=X,
num_cpus_per_worker=16,
replicate_pem=False,
timeout_s=100,
)
Here are the failed cases I observed,
- If num of workers > 3 and num_samples in ray.tune = 1, my jobs cannot be successfully scheduled and the error message is as below,
(WrappedHorovodTrainable pid=1744) pg_timeout=self.settings.placement_group_timeout_s)
(WrappedHorovodTrainable pid=1744) File "/home/jobuser/.shiv/cloudflow-dsl-example_1ad723c3c778eff40262addc5b4c0995164e49a795f5d2652d6e3c00bd8e6981/site-packages/horovod/ray/strategy.py", line 27, in create_placement_group
(WrappedHorovodTrainable pid=1744) ray.available_resources(), pg.bundle_specs))
(WrappedHorovodTrainable pid=1744) TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'accelerator_type:V100': 7.0, 'CPU': 36.0, 'GPU': 14.0, 'memory': 448000000000.0, 'object_store_memory': 130857013413.0, 'node:100.96.111.57': 1.0, 'node:100.96.168.69': 1.0, 'bundle_group_5c2c2753cce32fa8b80ff6c4b2706a34': 4000.0, 'CPU_group_5c2c2753cce32fa8b80ff6c4b2706a34': 64.0, 'CPU_group_2_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'bundle_group_2_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_3_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.153.61': 1.0, 'bundle_group_3_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'bundle_group_1_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_1_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.111.56': 1.0, 'bundle_group_0_5c2c2753cce32fa8b80ff6c4b2706a34': 1000.0, 'CPU_group_0_5c2c2753cce32fa8b80ff6c4b2706a34': 16.0, 'node:100.96.182.47': 1.0, 'node:100.96.97.193': 0.99, 'node:100.96.182.46': 1.0}, resources requested by the placement group: [{'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}]
- As above, if I set num of workers <= 3 and num_samples in ray.tune = 1. It can work. However, if I set the num_samples = 2. The error happen again and error message is similar as above.
(WrappedHorovodTrainable pid=2863, ip=100.96.182.47) File "/home/jobuser/.shiv/cloudflow-dsl-example_1ad723c3c778eff40262addc5b4c0995164e49a795f5d2652d6e3c00bd8e6981/site-packages/horovod/ray/strategy.py", line 27, in create_placement_group
(WrappedHorovodTrainable pid=2863, ip=100.96.182.47) ray.available_resources(), pg.bundle_specs))
(WrappedHorovodTrainable pid=2863, ip=100.96.182.47) TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 448000000000.0, 'GPU': 14.0, 'CPU_group_098e09e0654a0e5a144c2a3de206d28d': 48.0, 'accelerator_type:V100': 7.0, 'object_store_memory': 130857013413.0, 'bundle_group_1_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'bundle_group_098e09e0654a0e5a144c2a3de206d28d': 3000.0, 'CPU_group_1_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'node:100.96.111.57': 1.0, 'CPU_group_2_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'bundle_group_da96cefefb6a0d86f9c37d809d50f9ec': 3000.0, 'node:100.96.168.69': 1.0, 'CPU_group_da96cefefb6a0d86f9c37d809d50f9ec': 48.0, 'bundle_group_2_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0, 'node:100.96.153.61': 1.0, 'CPU_group_0_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'bundle_group_0_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'bundle_group_1_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0, 'CPU_group_1_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'node:100.96.111.56': 1.0, 'node:100.96.182.47': 1.0, 'CPU_group_2_098e09e0654a0e5a144c2a3de206d28d': 16.0, 'bundle_group_2_098e09e0654a0e5a144c2a3de206d28d': 1000.0, 'CPU': 4.0, 'node:100.96.97.193': 0.99, 'node:100.96.182.46': 1.0, 'CPU_group_0_da96cefefb6a0d86f9c37d809d50f9ec': 16.0, 'bundle_group_0_da96cefefb6a0d86f9c37d809d50f9ec': 1000.0}, resources requested by the placement group: [{'CPU': 16.0}, {'CPU': 16.0}, {'CPU': 16.0}]
Does this mean the num_samples may affect the resource allocation? Originally I think the minimum resource requirements should be the resource for 1 trial.
Here’s the lib version I used
- ray: 1.12.0
- horovod: 0.22.1