Is there a way to remove/switch off the autoscaler? I have a cluster workflow that starts a head node and runs a fixed number of models. Each model is run on its own worker node. As each worker completes, I would ideally like it to be removed? In the yaml config I have tried setting:
upscaling_speed: 0.0 # I know it says not to but this seems like the logical way to do it?
I have tried setting: initial_workers, max_workers, min_workers to the exact required amount, and I have tried setting target_utilization_fraction to 1.0.
The problem is that the process running on the worker is (by design) using almost all of the cpu capacity so it triggers autoscaling, I would like to switch this off? Many thanks
How many extra instances are you seeing (# of worker nodes - # of models running)? It sounds like the desired behavior should be possible just by setting target_utilization_fraction to 1
Thanks for your suggestion, though I have already tried this and it still autoscales. I have tried setting target_utilization_fraction to 1.1 but it errors with:
Failed validating 'maximum' in schema['properties']['target_utilization_fraction']:
{'description': 'DEPRECATED. Use upscaling_speed instead.',
'maximum': 1,
'minimum': 0,
'type': 'number'}
On instance['target_utilization_fraction']:
1.1
I have also tried playing with upscaling speed to negative but it fails with:
Failed validating 'minimum' in schema['properties']['upscaling_speed']:
But autoscaling still takes place up to the maximum workers allowed by the cluster and it is not clear why. I have capped it at 10 workers, but 7 of these are idle because when I call:
object_ids = [f.options(name=model).remote(model) for model in models]
models is a list of only 3, the cluster should be able to accept dynamic list sizes and scale appropriately. It makes sense that its desirable to have autoscaling if cpu usage exceeds a set threshold but there should also be an option to turn this off for circumstances where full cpu usage is a design feature of the process?
p.s. I have tried setting upscaling_speed: 0.0001 which does 'kind of’work but it always starts 1 extra worker than I need which just sits idle.
cc @Alex is there any way to get the desired behavior here? I suspect it might be possible with custom resources, but I’m wondering if there’s a simpler workaround
I have set the nested @ray.remote(num_cpus=cpus-1) as a workaround for now, so a cluster of n 16 core boxes will only use 15 cores each for nested ray processes. Though this is not ideal as I am not using full cpu capacity.
I notice that this actually still triggers autoscaling briefly, I really do need a proper solution to switch off the autoscaler and allow ray to natively fix the cluster size to the number of of remote function calls. If anyone can help with this it would be really appreciated? Thanks