Ray cluster's worker node is pending

plum9 · February 8, 2022, 12:43am

Hi,
I’m trying to spin a ray cluster on AWS’s EC2 using a YAML file. After using ray up config.yaml it successfully creates two EC2 instances but, when I try to submit a job or look at Ray status it only indicates one node and

Pending:
 <ip>: ray.worker.default, uninitialized

with no failures. I’ve removed security groups and path files for config file below


# An unique identifier for the head node and workers of this cluster.
cluster_name: ray-test

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 2.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # gpu You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray-ml:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 10

# Cloud-provider specific configuration.
provider:
   type: aws
   region: us-east-1
   availability_zone: us-east-1a,us-east-1b


auth:
    ssh_user: ubuntu
    ssh_private_key: /path/to/key/.pem

available_node_types:
    ray.head.default:
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        #resources: { "CPU": 4, "GPU": 1}
        node_config:
            # IamInstanceProfile:
            #     Name: "ray-autoscaler-v1"
            InstanceType: p3.2xlarge #g4dn.2xlarge  p3.2xlarge
            ImageId: ami-029510cec6d69f121 #ami-029510cec6d69f121 # Deep Learning AMI (Ubuntu) Version 30
            KeyName: <key-name>
            SecurityGroupIds: [sg1, sg2, sg3] #See above for group IDS
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 200 #100GB
    ray.worker.default:
        min_workers: 1
        max_workers: 2
        resources: {}
        node_config:
            # IamInstanceProfile:
            #     Name: "ray-autoscaler-v1"
            InstanceType:  p3.2xlarge #g4dn.2xlarge  p3.2xlarge
            ImageId: ami-029510cec6d69f121 #ami-029510cec6d69f121 # Deep Learning AMI (Ubuntu) Version 30
            KeyName: <key-name>
            #InstanceMarketOptions:
            #    MarketType: spot
            SecurityGroupIds: [<sg1>]


head_node_type: ray.head.default
file_mounts: {
#    "/path2/on/remote/machine": "/path2/on/local/machine", #/home/ray
}


cluster_synced_files: []
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: 
    - pip install -U ninja
    - pip install -U lpips
    - pip install tblib
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}```

plum9 · February 8, 2022, 12:45am

ray.cluster_resources() returns {'memory': 37903803188.0, 'node:<ip>': 1.0, 'GPU': 1.0, 'accelerator_type:V100': 1.0, 'object_store_memory': 18951901593.0, 'CPU': 8.0}

instead it should return 2 nodes.

plum9 · February 8, 2022, 3:48am

I figured out the problem was private key I specified. It was not accessible in worker nodes. I commented out those lines and it spinned up (ref: https://github.com/ray-project/ray/issues/18529)

But after few mins of idle it goes back to uninitialized.

Topic		Replies	Views
Starting up ray cluster on AWS EC2 instance Ray Clusters	4	1137	April 2, 2024
Submitting job to remote AWS cluster Ray Core	3	241	April 5, 2024
Ray workers can't ssh to head node Ray Core	5	748	June 14, 2022
Ray starts head node succesfully but no workers (Azure) Ray Clusters	2	578	June 29, 2022
Local cluster with multiple nodes in YAML config, while there's only head being started... Any hints? Ray Clusters	11	1637	June 17, 2022

Ray cluster's worker node is pending

Related topics