Starting up ray cluster on AWS EC2 instance

Hello all,

I am completely new to this, and I got stuck at starting up the ray cluster.

Here is my .yaml configuration:

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker: {}

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes will be launched in the first listed availability zone and will
    # be tried in the subsequent availability zones if launching fails.
    availability_zone: us-east-1f
    # Whether to allow node reuse. If set to False, nodes will be terminated
    # instead of stopped.
    cache_stopped_nodes: True # If not present, the default is True.
    # security_group:
    #     GroupName: default

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu #ec2-user
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
#    ssh_private_key: /path/to/your/key.pem
    ssh_private_key: C:/Users/Stefan/.aws/yt_dl_ray_keypair.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: t2.nano
            # You can provision additional disk space with a conf as follows
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 256
            # Additional options in the boto docs.
            KeyName: yt_dl_ray_keypair
    ray.worker.default:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: t2.nano
            # Run workers on spot by default. Comment this out to use on-demand.
            InstanceMarketOptions:
                MarketType: spot
                # Additional options can be found in the boto docs, e.g.
                #   SpotOptions:
                #       MaxPrice: MAX_HOURLY_PRICE
            # Additional options in the boto docs.
            KeyName: yt_dl_ray_keypair
# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude: []

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter: []

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands:
    - >-
        (stat $HOME/anaconda3/envs/tensorflow2_p38/ &> /dev/null &&
        echo 'export PATH="$HOME/anaconda3/envs/tensorflow2_p38/bin:$PATH"' >> ~/.bashrc) || true
    - which ray || pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip install 'boto3>=1.4.8'  # 1.4.8 adds InstanceMarketOptions

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

It successfully starts an EC2 instance on AWS and passes initialization, but in the terminal I get the following error:

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 3.236.16.235
getsockname failed: Not a socket
Timeout, server 3.236.16.235 not responding.
    SSH still not available (SSH command failed.), retrying in 5 seconds.

Is someone seeing what I am doing wrong?

Thank you!

I took your yaml file, updated the ssh key and gave it shot.

In my case I can see:

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 52.37.104.58
ssh: connect to host 52.37.104.58 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 52.37.104.58 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 52.37.104.58 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 52.37.104.58 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 52.37.104.58 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 52.37.104.58 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '52.37.104.58' (ED25519) to the list of known hosts.
 03:27:07 up 1 min,  1 user,  load average: 2.18, 0.64, 0.22
Shared connection to 52.37.104.58 closed.
    Success.
  Updating cluster configuration. [hash=577fb2637b455b6ec60b56516bf2a2e543686bdb]
  New status: syncing-files
  [2/7] Processing file mounts

It does take a while since the instance takes some time to be ready. After that, it failed with the following error:

ValueError: Attempting to cap object store memory usage at 35806003 bytes, but the minimum allowed is 78643200 bytes.

Then changing the instance type from t2.nano to t2.medium fixed the issue, and I was able to run python -c 'import ray; ray.init()' inside the head node.

How many connection attempts does it do in your case ? I think your file (except t2.nano) looks fine, so maybe there is an issues somewhere else in your env.

Thanks, @fordaz for helping me out on this. When I ran the command ray up <file-name>.yaml on WSL I got the same error as you and with changing the model as you did I managed to get the cluster up and running.

Currently, I have an up and running head (and worker nodes) on AWS, ran them up from WSL on Windows. The problem that I am facing now is how to access them on my local Windows machine. I tried ray.init() but it initialize a new local cluster. Also tried running ray start --head with ports as set in the YAML file but also it starts a new cluster (checked with ray.nodes(), new node is created not the one from YAML file). I guess that I am missing some port forwarding from WSL to the local machine. Maybe I just put myself in a rabbit hole and it is possible avoid this with a different solution but I am not sure how to proceed.

I am trying to make a YouTube scraper that will be parallelized with RAY on AWS EC2 instances and to store data on AWS S3.

Having the cluster up and running, I am not sure how to continue with my project or more precisely how I can get my module/package (since it won’t be just one python script) running on AWS EC2 with RAY.

Can you provide me some suggestions please? I would really appreciate your help!

I’m glad to hear your cluster is up @Stefan_Borkovski . Here is one relatively easy way (probably not for production) to start submitting jobs to your remote cluster is.

In one terminal session run:

ray dashboard <your-config-file>.yaml

At this point, there is an ssh tunnel which maps your remote cluster IP:8265 to your localhost:8265.

In a second terminal session you can use something like:

For a simple test:

ray job submit --address http://localhost:8265 \
-- python -c "import ray; ray.init(); print(ray.cluster_resources())"

Or for running a specific simple python script:

RAY_ADDRESS='http://localhost:8265' ray job submit \
--working-dir . -- python <your-python-script>.py

Or for running a script with some dependencies:

RAY_ADDRESS='http://localhost:8265' ray job submit \
--runtime-env-json='{"pip": ["numpy","torch","torchvision","torchaudio","matplotlib","pandas","ray[train]"]}' \
--working-dir . \
-- python <your-python-script>.py

One last thing, I’ve had issues when the local working directory had large files, as all of of these need to be uploaded into the remote server. You can 1) remove unnecessary files 2) download them directly within the cluster.

Hello @fordaz, thanks for the instructions but on WSL nothing from the above connects to the existing cluster. I also tried everything from the documentation.

Will have to search for different solution, since even running one script without S3 as a storage solution won’t help.

If you can suggest solution with Dockerization of the whole project and S3 support that would be helpful.

Thanks.