Scale up from 0

shoelesself · July 13, 2021, 12:58pm

Is it possible for Ray cluster to scale-up from 0 workers? If I start a Kubernetes cluster using minWorkers: 0 and I start a job no workers are launched. If I change that to minWorkers: 1 then everything scales up and works fine. The autoscaler doesn’t seem to recognize the demand when there are no workers.

======== Autoscaler status: 2021-07-13 05:56:40.044341 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.00/1.367 GiB memory
 0.00/0.571 GiB object_store_memory

Demands:
 (no resource demands)
ray,ray:2021-07-13 05:56:40,046	DEBUG legacy_info_string.py:24 -- Cluster status: 0 nodes
 - MostDelayedHeartbeats: {'10.16.202.163': 0.14696931838989258}
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - ResourceUsage: 0.0 GiB/1.37 GiB memory, 0.0 GiB/0.57 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
ray,ray:2021-07-13 05:56:45,170	DEBUG resource_demand_scheduler.py:160 -- Cluster resources: [{'memory': 1468006400.0, 'object_store_memory': 612928463.0, 'node:10.16.202.163': 1.0}]
ray,ray:2021-07-13 05:56:45,170	DEBUG resource_demand_scheduler.py:161 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1})
ray,ray:2021-07-13 05:56:45,170	DEBUG resource_demand_scheduler.py:172 -- Placement group demands: []
ray,ray:2021-07-13 05:56:45,171	DEBUG resource_demand_scheduler.py:218 -- Resource demands: []
ray,ray:2021-07-13 05:56:45,171	DEBUG resource_demand_scheduler.py:219 -- Unfulfilled demands: []
ray,ray:2021-07-13 05:56:45,190	DEBUG resource_demand_scheduler.py:241 -- Node requests: {}
ray,ray:2021-07-13 05:56:45,218	INFO autoscaler.py:354 --

will · July 14, 2021, 11:31pm

Got it. Can you provide the Python script you run after the cluster is launched?

Dmitri · July 15, 2021, 4:29am

This seems to be a recent regression – are you deploying with helm and the default images?

github.com/ray-project/ray

[k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod

opened 05:06PM - 07 Jul 21 UTC

Luis-Victor

P1 bug k8s

### What is the problem? I'm trying to avoid task scheduling on rayHead by se…tting `.values.rayResourses: { "CPU": 0 }` but it keeps trying to schedule tasks at headPod, failing due to lack of resources and, this way, getting stuck without scaling any workerPods. - Python version: 3.8.5 - Linux: Ubuntu 18.04 - Kubernetes version: 1.18.14 (AKS) ### Reproduction (REQUIRED) - Install k8s ray cluster using helm - values.yaml ``` # RayCluster settings: image: rayproject/ray:nightly-py38 headPodType: rayHeadType podTypes: rayHeadType: minWorkers: 0 maxWorkers: 0 CPU: 1 memory: 512Mi GPU: 0 rayResources: { "CPU": 0, "GPU": 0 } nodeSelector: { use: development } rayWorkerType: minWorkers: 0 maxWorkers: 6 memory: 512Mi CPU: 1 GPU: 0 rayResources: { "GPU": 0 } nodeSelector: { use: development } # Operator settings: operatorOnly: false clusterOnly: false namespacedOperator: false operatorNamespace: default operatorImage: rayproject/ray:nightly-py38 ``` - task.py ``` import ray LOCAL_PORT = 10001 @ray.remote(num_cpus=1) def f(i): import time print(f"Try number {i}") time.sleep(60) if __name__ == "__main__": ray.util.connect(f"127.0.0.1:{LOCAL_PORT}") ray.get([f.remote(i + 1) for i in range(10)]) ``` - task log ``` The actor or task with ID ehd9rsa79fe5783b8bbc6e3fba23srd094a7q98c7axec7 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task. ``` If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script". - [x] I have verified my script runs in a clean environment and reproduces the issue. - [x] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

shoelesself · July 15, 2021, 12:00pm

import time

import ray

LOCAL_PORT = 10001


@ray.remote
def do_some_work(x):
    print('doing some work')
    time.sleep(1)  # Replace this is with work you need to do.
    return x*x


def main():
    start = time.time()
    results = [ray.get(do_some_work.remote(x)) for x in range(4)]
    print("duration =", time.time() - start)
    print("results = ", results)


if __name__ == '__main__':
    ray.client(f'127.0.0.1:{LOCAL_PORT}').connect()
    main()

shoelesself · July 15, 2021, 12:01pm

I am deploying with helm and the head node is using the default image. The worker nodes use a custom image derived from rayproject/ray:latest-py37-cpu.

Dmitri · July 15, 2021, 2:23pm

Got it.
Would you mind pasting the values.yaml to make sure we’re not missing anything?

shoelesself · July 15, 2021, 2:25pm

podTypes:
  rayHeadType:
    memory: 2000Mi
    nodeSelector:
      role: head
    rayResources:
      CPU: 0
  rayWorkerType:
    CPU: 62
    maxWorkers: 800
    memory: 244000Mi
    minWorkers: 1
    nodeSelector:
      role: worker

Dmitri · July 15, 2021, 2:44pm

Thanks! The 0 CPU annotation is an important detail.

Topic		Replies	Views
Autoscaler does not scale in ray1.4 with 0 CPUs allocated head node Kubernetes	1	473	July 27, 2021
Min_workers doesn't seem to be honored Kubernetes	15	883	February 27, 2021
Testing autoscaler Kubernetes	15	1547	March 16, 2021
Only head node started, not worker nodes Ray Clusters	1	1509	January 19, 2022
Ray Cluster seem to be spawning less nodes than it should Ray Clusters	8	298	August 28, 2024

Scale up from 0

Related topics