Cluster usage is not 100% rather 57%

charis · October 21, 2021, 8:49am

Hi,

I have successfully deployed a cluster on GCP. Initial setup was 1 head and 6 workers (each with 8 CPUs), and through ray status I was able to observe that all of the CPUs were utilized. When I doubled the number of workers, I notice that the only 60/104 CPUs are utilized. All of nodes (head + 12 workers) show up as healthy.

I am parallelizing a for loop (N=60 iterations) so I was excepting each worker to get assigned one iteration (as it was the case when I had 6 workers). Unfortunately, the dashboard is not connecting so I am not sure what is happening.

One thing to note, is that looking at the GCP CPU utilization rate is shows up around 70-80% (still 100% as show previously) so I don’t know if the nodes are actually being used but not showing up on the autoscaler report.

Any ideas/suggestions as to why the additional nodes are not being utilized?

# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# The maximum number of worker nodes to launch in addition to the head
# node. min_workers default to 0.
max_workers: 12


# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-a
    project_id: ####

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    
    
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_default:
        # The resources provided by this node type.
        resources: {"CPU": 8}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu

            # Additional options can be found in in the compute docs at
            # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

            # If the network interface is specified as below in both head and worker
            # nodes, the manual network config is used.  Otherwise an existing subnet is
            # used.  To use a shared subnet, ask the subnet owner to grant permission
            # for 'compute.subnetworks.use' to the ray autoscaler account...
            # networkInterfaces:
            #   - kind: compute#networkInterface
            #     subnetwork: path/to/subnet
            #     aliasIpRanges: []
    ray_worker_small:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 12
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 12
        # The resources provided by this node type.
        resources: {"CPU": 8}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
#             scheduling:
#               - preemptible: true
            # Un-Comment this to launch workers with the Service Account of the Head Node
            # serviceAccounts:
            # - email: ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com
            #   scopes:
            #   - https://www.googleapis.com/auth/cloud-platform

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default

worker_default_node_type: ray_worker_small

setup_commands:
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    - pip install -U ray[default] 
    - pip install fastparquet ta cvxpy lightgbm tensorflow==2.3.1


file_mounts: {
   "./BackTest_Output/" : "./BackTest_Output/",
   "./Universe/": "./Universe/",
   "./Quandl/" : "./Quandl/",
   "./Backtest_ray/" : "./Backtest_ray/"   
}



# # Command to start ray on the head node. You don't need to change this.
# head_start_ray_commands:
#     - ray stop
#     - >-
#       ray start
#       --head
#       --port=6379
#       --object-manager-port=8076
#       --autoscaling-config=~/ray_bootstrap_config.yaml

# # Command to start ray on worker nodes. You don't need to change this.
# worker_start_ray_commands:
#     - ray stop
#     - >-
#       ray start
#       --address=$RAY_HEAD_IP:6379
#       --object-manager-port=8076

Topic		Replies	Views
Troubles setting up a Ray Cluster on the Google Cloud Platform (GCP) Ray Core	2	560	March 3, 2021
Ray cluster number issue Ray Clusters	6	436	June 6, 2022
Does Ray Autoscaler has a maximum numbers of nodes it can handle? Ray Clusters	2	394	July 30, 2021
[GCP] Ray Cluster on GCP scales up very slowly Ray Clusters	1	635	December 14, 2021
About CPU Usage in multi nodes Ray Core	2	344	February 14, 2023

Cluster usage is not 100% rather 57%

Related topics