Is it possible for Ray cluster to scale-up from 0 workers? If I start a Kubernetes cluster using minWorkers: 0
and I start a job no workers are launched. If I change that to minWorkers: 1
then everything scales up and works fine. The autoscaler doesn’t seem to recognize the demand when there are no workers.
======== Autoscaler status: 2021-07-13 05:56:40.044341 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.00/1.367 GiB memory
0.00/0.571 GiB object_store_memory
Demands:
(no resource demands)
ray,ray:2021-07-13 05:56:40,046 DEBUG legacy_info_string.py:24 -- Cluster status: 0 nodes
- MostDelayedHeartbeats: {'10.16.202.163': 0.14696931838989258}
- NodeIdleSeconds: Min=0 Mean=0 Max=0
- ResourceUsage: 0.0 GiB/1.37 GiB memory, 0.0 GiB/0.57 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
ray,ray:2021-07-13 05:56:45,170 DEBUG resource_demand_scheduler.py:160 -- Cluster resources: [{'memory': 1468006400.0, 'object_store_memory': 612928463.0, 'node:10.16.202.163': 1.0}]
ray,ray:2021-07-13 05:56:45,170 DEBUG resource_demand_scheduler.py:161 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1})
ray,ray:2021-07-13 05:56:45,170 DEBUG resource_demand_scheduler.py:172 -- Placement group demands: []
ray,ray:2021-07-13 05:56:45,171 DEBUG resource_demand_scheduler.py:218 -- Resource demands: []
ray,ray:2021-07-13 05:56:45,171 DEBUG resource_demand_scheduler.py:219 -- Unfulfilled demands: []
ray,ray:2021-07-13 05:56:45,190 DEBUG resource_demand_scheduler.py:241 -- Node requests: {}
ray,ray:2021-07-13 05:56:45,218 INFO autoscaler.py:354 --
will
July 14, 2021, 11:31pm
2
Got it. Can you provide the Python script you run after the cluster is launched?
Dmitri
July 15, 2021, 4:29am
3
This seems to be a recent regression – are you deploying with helm and the default images?
opened 05:06PM - 07 Jul 21 UTC
P1
bug
k8s
### What is the problem?
I'm trying to avoid task scheduling on rayHead by se… tting `.values.rayResourses: { "CPU": 0 }` but it keeps trying to schedule tasks at headPod, failing due to lack of resources and, this way, getting stuck without scaling any workerPods.
- Python version: 3.8.5
- Linux: Ubuntu 18.04
- Kubernetes version: 1.18.14 (AKS)
### Reproduction (REQUIRED)
- Install k8s ray cluster using helm
- values.yaml
```
# RayCluster settings:
image: rayproject/ray:nightly-py38
headPodType: rayHeadType
podTypes:
rayHeadType:
minWorkers: 0
maxWorkers: 0
CPU: 1
memory: 512Mi
GPU: 0
rayResources: { "CPU": 0, "GPU": 0 }
nodeSelector: { use: development }
rayWorkerType:
minWorkers: 0
maxWorkers: 6
memory: 512Mi
CPU: 1
GPU: 0
rayResources: { "GPU": 0 }
nodeSelector: { use: development }
# Operator settings:
operatorOnly: false
clusterOnly: false
namespacedOperator: false
operatorNamespace: default
operatorImage: rayproject/ray:nightly-py38
```
- task.py
```
import ray
LOCAL_PORT = 10001
@ray.remote(num_cpus=1)
def f(i):
import time
print(f"Try number {i}")
time.sleep(60)
if __name__ == "__main__":
ray.util.connect(f"127.0.0.1:{LOCAL_PORT}")
ray.get([f.remote(i + 1) for i in range(10)])
```
- task log
```
The actor or task with ID ehd9rsa79fe5783b8bbc6e3fba23srd094a7q98c7axec7 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
```
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
- [x] I have verified my script runs in a clean environment and reproduces the issue.
- [x] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).
import time
import ray
LOCAL_PORT = 10001
@ray.remote
def do_some_work(x):
print('doing some work')
time.sleep(1) # Replace this is with work you need to do.
return x*x
def main():
start = time.time()
results = [ray.get(do_some_work.remote(x)) for x in range(4)]
print("duration =", time.time() - start)
print("results = ", results)
if __name__ == '__main__':
ray.client(f'127.0.0.1:{LOCAL_PORT}').connect()
main()
I am deploying with helm and the head node is using the default image. The worker nodes use a custom image derived from rayproject/ray:latest-py37-cpu.
Dmitri
July 15, 2021, 2:23pm
6
Got it.
Would you mind pasting the values.yaml to make sure we’re not missing anything?
podTypes:
rayHeadType:
memory: 2000Mi
nodeSelector:
role: head
rayResources:
CPU: 0
rayWorkerType:
CPU: 62
maxWorkers: 800
memory: 244000Mi
minWorkers: 1
nodeSelector:
role: worker
Dmitri
July 15, 2021, 2:44pm
8
Thanks! The 0 CPU annotation is an important detail.