Hi,
I have successfully deployed a cluster on GCP. Initial setup was 1 head and 6 workers (each with 8 CPUs), and through ray status I was able to observe that all of the CPUs were utilized. When I doubled the number of workers, I notice that the only 60/104 CPUs are utilized. All of nodes (head + 12 workers) show up as healthy.
I am parallelizing a for loop (N=60 iterations) so I was excepting each worker to get assigned one iteration (as it was the case when I had 6 workers). Unfortunately, the dashboard is not connecting so I am not sure what is happening.
One thing to note, is that looking at the GCP CPU utilization rate is shows up around 70-80% (still 100% as show previously) so I don’t know if the nodes are actually being used but not showing up on the autoscaler report.
Any ideas/suggestions as to why the additional nodes are not being utilized?
# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# The maximum number of worker nodes to launch in addition to the head
# node. min_workers default to 0.
max_workers: 12
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-central1
availability_zone: us-central1-a
project_id: ####
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray_head_default:
# The resources provided by this node type.
resources: {"CPU": 8}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-8
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
# Additional options can be found in in the compute docs at
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
# If the network interface is specified as below in both head and worker
# nodes, the manual network config is used. Otherwise an existing subnet is
# used. To use a shared subnet, ask the subnet owner to grant permission
# for 'compute.subnetworks.use' to the ray autoscaler account...
# networkInterfaces:
# - kind: compute#networkInterface
# subnetwork: path/to/subnet
# aliasIpRanges: []
ray_worker_small:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 12
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 12
# The resources provided by this node type.
resources: {"CPU": 8}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-8
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
# Run workers on preemtible instance by default.
# Comment this out to use on-demand.
# scheduling:
# - preemptible: true
# Un-Comment this to launch workers with the Service Account of the Head Node
# serviceAccounts:
# - email: ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com
# scopes:
# - https://www.googleapis.com/auth/cloud-platform
# Additional options can be found in in the compute docs at
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default
worker_default_node_type: ray_worker_small
setup_commands:
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
- pip install -U ray[default]
- pip install fastparquet ta cvxpy lightgbm tensorflow==2.3.1
file_mounts: {
"./BackTest_Output/" : "./BackTest_Output/",
"./Universe/": "./Universe/",
"./Quandl/" : "./Quandl/",
"./Backtest_ray/" : "./Backtest_ray/"
}
# # Command to start ray on the head node. You don't need to change this.
# head_start_ray_commands:
# - ray stop
# - >-
# ray start
# --head
# --port=6379
# --object-manager-port=8076
# --autoscaling-config=~/ray_bootstrap_config.yaml
# # Command to start ray on worker nodes. You don't need to change this.
# worker_start_ray_commands:
# - ray stop
# - >-
# ray start
# --address=$RAY_HEAD_IP:6379
# --object-manager-port=8076