Should I be concerned about this message "The object store is using /tmp instead of /dev/shm"?

Hello!

I am using an autoscaler with Docker. Pasted the yaml below. When using ray up, I am getting below WARNING message and I am not sure if I should be concerned about it. The error message is also not really making sense to me; it’s saying that /dev/shm has 200GB, but that I should also pass 200GB to the Docker container… so I am not really sure what the suggestion --shm-size=204.89gb does. I did try that and it didn’t make the WARNING message go away.

2021-02-10 05:19:58,150 WARNING services.py:1662 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 200000000000 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=204.89gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.

# An unique identifier for the head node and workers of this cluster.
cluster_name: jkkwon_ray

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 34

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 34

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr"
    container_name: "miami_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: ["-v", "/home/ubuntu/efs:/efs"]

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d
    cache_stopped_nodes: False

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
    ssh_private_key: miami_dev_dask_emr_key_pair.pem

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
head_node:
    InstanceType: r5n.24xlarge
    ImageId: ami-0f92e9d2b63bc61a2
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-02876545b671b57b0"
    ]
    # You can provision additional disk space with a conf as follows
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 300
    KeyName: "miami_dev_dask_emr_key_pair"


# Provider-specific config for worker nodes, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
worker_nodes:
    InstanceType: r5n.24xlarge
    ImageId: ami-0f92e9d2b63bc61a2
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/  19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    KeyName: "miami_dev_dask_emr_key_pair"
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 300

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
      - aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 048211272910.dkr.ecr.us-west-2.amazonaws.com;
      - sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
        sudo pkill -9 apt-get;
        sudo pkill -9 dpkg;
        sudo dpkg --configure -a;
        sudo apt-get -y install binutils;
        cd $HOME;
        git clone https://github.com/aws/efs-utils;
        cd $HOME/efs-utils;
        ./build-deb.sh;
        sudo apt-get -y install ./build/amazon-efs-utils*deb;
        cd $HOME;
        mkdir efs;
        sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-098309a3.efs.us-west-2.amazonaws.com:/ efs;
        sudo chmod 777 efs;

setup_commands:
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
1 Like

Ray natively supports a distributed shared memory store for large objects (objects that are bigger than 100KB) to efficiently manage the memory usage of the cluster (Memory Management — Ray v1.1.0). When using Linux, Ray tries to use /dev/shm for the shared memory. But if this is too small, Ray automatically fallback to /tmp folder, which usually can cause worse performance than using the shared memory.

The warning indicates that your docker container doesn’t have enough shared memory, so it will use /tmp folder. This shouldn’t cause any correctness issue, but it can impact performance (for example, we discovered 100%+ performance regression when we used /tmp instead of /dev/shm). Passing the flag basically will reserve more shared memory for the container.

@ijrsvt Do you know why she’s still seeing the warning after setting the correct shared memory?

1 Like

@jennakwon06
How exactly were you setting it? The best way would be something like:

docker:
    run_options:
    - --shm-size=204.89gb

Also, what version of the cluster launcher are you using?

This is my yaml file when setting it;

docker:
    image: "048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr"
    container_name: "miami_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: ["-v", "/home/ubuntu/efs:/efs", "--shm-size=204.89gb"]

Is there a separate Ray autoscaler version? If not, my Ray version is version 2.0.0.dev0

That looks like the correct way to input it! And yes, Ray version == Ray Autoscaler version.

What are the results of running something like:

import ray
ray.init(address="auto")
ray.available_resources()
ray.nodes()

So I am using 34 r5.24xlarge EC2 instances (pretty biffy; each with 768GB memory and 96 cores). Requirement is to process Petabytes of data. Output is below.

ray_available_resources()

{'memory': 402099.0, 'object_store_memory': 92120.0, 'node:10.0.237.118': 1.0, 'CPU': 3336.0, 'node:10.0.203.248': 1.0, 'node:10.0.219.244': 1.0, 'node:10.0.235.227': 1.0, 'node:10.0.216.36': 1.0, 'node:10.0.198.166': 1.0, 'node:10.0.236.239': 1.0, 'node:10.0.251.63': 1.0, 'node:10.0.213.40': 1.0, 'node:10.0.197.14': 1.0, 'node:10.0.249.190': 1.0, 'node:10.0.252.166': 1.0, 'node:10.0.186.87': 1.0, 'node:10.0.213.141': 1.0, 'node:10.0.197.170': 1.0, 'node:10.0.251.190': 1.0, 'node:10.0.196.46': 1.0, 'node:10.0.172.62': 1.0, 'node:10.0.0.233': 1.0, 'node:10.0.195.29': 1.0, 'node:10.0.209.174': 1.0, 'node:10.0.194.75': 1.0, 'node:10.0.213.78': 1.0, 'node:10.0.210.241': 1.0, 'node:10.0.241.223': 1.0, 'node:10.0.220.87': 1.0, 'node:10.0.163.15': 1.0, 'node:10.0.250.150': 1.0, 'node:10.0.220.246': 1.0, 'node:10.0.237.108': 1.0, 'node:10.0.211.58': 1.0, 'node:10.0.194.11': 1.0, 'node:10.0.222.230': 1.0, 'node:10.0.221.102': 1.0, 'node:10.0.216.160': 1.0}

ray.nodes()


[{'NodeID': 'cbf1775655cfd5a5d59b0e4c323ae3b007f007428d2963da77b7b363', 'Alive': True, 'NodeManagerAddress': '10.0.210.241', 'NodeManagerHostname': 'ip-10-0-210-241', 'NodeManagerPort': 58518, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 59982, 'alive': True, 'Resources': {'CPU': 96.0, 'object_store_memory': 2632.0, 'node:10.0.210.241': 1.0, 'memory': 11494.0}}, {'NodeID': '210536ea05701759f5fb1733bf8d2141ca9288f4d39e6f47bac4a51e', 'Alive': True, 'NodeManagerAddress': '10.0.197.14', 'NodeManagerHostname': 'ip-10-0-197-14', 'NodeManagerPort': 63457, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 62208, 'alive': True, 'Resources': {'node:10.0.197.14': 1.0, 'object_store_memory': 2632.0, 'memory': 11494.0, 'CPU': 96.0}}, {'NodeID': '9a9020520e2a9fadef1c22e90150425935969347115a6947d424ddf1', 'Alive': True, 'NodeManagerAddress': '10.0.213.40', 'NodeManagerHostname': 'ip-10-0-213-40', 'NodeManagerPort': 57187, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 65141, 'alive': True, 'Resources': {'CPU': 96.0, 'node:10.0.213.40': 1.0, 'object_store_memory': 2632.0, 'memory': 11494.0}}, {'NodeID': 'ccf1c632ea3c2bbcc9886377b18ac8dfd78b73d2bbf74404cd07c9ac', 'Alive': True, 'NodeManagerAddress': '10.0.249.190', 'NodeManagerHostname': 'ip-10-0-249-190', 'NodeManagerPort': 41452, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 65027, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'CPU': 96.0, 'memory': 11494.0, 'node:10.0.249.190': 1.0}}, {'NodeID': '36b8726f98b197dbc8392c0e690ea81e5329ab7ba5c3983946f341b4', 'Alive': True, 'NodeManagerAddress': '10.0.213.141', 'NodeManagerHostname': 'ip-10-0-213-141', 'NodeManagerPort': 63680, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 56423, 'alive': True, 'Resources': {'CPU': 96.0, 'object_store_memory': 2632.0, 'memory': 11494.0, 'node:10.0.213.141': 1.0}}, {'NodeID': 'e20e7971d05ebc877a18d77a055785966c50deef060709980d7c573e', 'Alive': True, 'NodeManagerAddress': '10.0.252.166', 'NodeManagerHostname': 'ip-10-0-252-166', 'NodeManagerPort': 61391, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 51786, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'memory': 11494.0, 'node:10.0.252.166': 1.0, 'CPU': 96.0}}, {'NodeID': '2a6b428ca6a73eeb7bd81fa9540bf9d1bff744b0102310fff06a003d', 'Alive': True, 'NodeManagerAddress': '10.0.186.87', 'NodeManagerHostname': 'ip-10-0-186-87', 'NodeManagerPort': 45721, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 64649, 'alive': True, 'Resources': {'node:10.0.186.87': 1.0, 'memory': 11494.0, 'CPU': 96.0, 'object_store_memory': 2632.0}}, {'NodeID': 'a65e4a95110b3fd96073b30dda22634485d29ec08daaa7349ed8a8a9', 'Alive': True, 'NodeManagerAddress': '10.0.251.190', 'NodeManagerHostname': 'ip-10-0-251-190', 'NodeManagerPort': 60401, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 58081, 'alive': True, 'Resources': {'node:10.0.251.190': 1.0, 'object_store_memory': 2632.0, 'CPU': 96.0, 'memory': 11494.0}}, {'NodeID': 'f73f85f686be509562b235c6c2e31ea27dc9e202aec107e079bf130d', 'Alive': True, 'NodeManagerAddress': '10.0.196.46', 'NodeManagerHostname': 'ip-10-0-196-46', 'NodeManagerPort': 60196, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 48833, 'alive': True, 'Resources': {'node:10.0.196.46': 1.0, 'CPU': 96.0, 'memory': 11494.0, 'object_store_memory': 2632.0}}, {'NodeID': '6bf0d763ca13d5230cc6cbb02ea300fcb22a7bca9916f27b1f4a74ce', 'Alive': True, 'NodeManagerAddress': '10.0.197.170', 'NodeManagerHostname': 'ip-10-0-197-170', 'NodeManagerPort': 52487, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 59670, 'alive': True, 'Resources': {'memory': 11494.0, 'object_store_memory': 2632.0, 'node:10.0.197.170': 1.0, 'CPU': 96.0}}, {'NodeID': '64ae3ff487cd618ddadcac4ec70d3bf103331eb1a726b28b7ddfbb54', 'Alive': True, 'NodeManagerAddress': '10.0.172.62', 'NodeManagerHostname': 'ip-10-0-172-62', 'NodeManagerPort': 57966, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 53109, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'memory': 11494.0, 'CPU': 96.0, 'node:10.0.172.62': 1.0}}, {'NodeID': '94f42eb1c1d045fcbe1f8b19f96267d7afe19e4e5cb93a0104262750', 'Alive': True, 'NodeManagerAddress': '10.0.203.248', 'NodeManagerHostname': 'ip-10-0-203-248', 'NodeManagerPort': 61440, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 44975, 'alive': True, 'Resources': {'node:10.0.203.248': 1.0, 'memory': 11494.0, 'CPU': 96.0, 'object_store_memory': 2632.0}}, {'NodeID': '59ffb0c1264facfa0ed4576a1bc3efa9d6dad0e600c7aef7b2c6812c', 'Alive': True, 'NodeManagerAddress': '10.0.237.118', 'NodeManagerHostname': 'ip-10-0-237-118', 'NodeManagerPort': 63555, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 57761, 'alive': True, 'Resources': {'node:10.0.237.118': 1.0, 'CPU': 96.0, 'object_store_memory': 2632.0, 'memory': 11494.0}}, {'NodeID': 'a00d7b057bb66dac8270fbab470fc2e3b60679d13e9a1ef07d39df06', 'Alive': True, 'NodeManagerAddress': '10.0.216.36', 'NodeManagerHostname': 'ip-10-0-216-36', 'NodeManagerPort': 57480, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 41108, 'alive': True, 'Resources': {'node:10.0.216.36': 1.0, 'memory': 11494.0, 'CPU': 96.0, 'object_store_memory': 2632.0}}, {'NodeID': 'c2238601daa0413edec4c209d8e9cac7c9b85905cfbefdc42d1631f5', 'Alive': True, 'NodeManagerAddress': '10.0.235.227', 'NodeManagerHostname': 'ip-10-0-235-227', 'NodeManagerPort': 64381, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 46653, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'memory': 11494.0, 'node:10.0.235.227': 1.0, 'CPU': 96.0}}, {'NodeID': '90d36b1365781cf72ec8d8f920223b85e0b85d9e1c91e299e00faea9', 'Alive': True, 'NodeManagerAddress': '10.0.219.244', 'NodeManagerHostname': 'ip-10-0-219-244', 'NodeManagerPort': 61465, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 59357, 'alive': True, 'Resources': {'memory': 11494.0, 'object_store_memory': 2632.0, 'node:10.0.219.244': 1.0, 'CPU': 96.0}}, {'NodeID': '243f2531be3643f251c73e4ac8b0cbf6052aa12910dd052247bbef39', 'Alive': True, 'NodeManagerAddress': '10.0.198.166', 'NodeManagerHostname': 'ip-10-0-198-166', 'NodeManagerPort': 61830, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 64140, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'CPU': 96.0, 'node:10.0.198.166': 1.0, 'memory': 11494.0}}, {'NodeID': '8afd925fd6b26c7e382854f16ceae4b75c9a6e35e6c6ffe30a4cab59', 'Alive': True, 'NodeManagerAddress': '10.0.236.239', 'NodeManagerHostname': 'ip-10-0-236-239', 'NodeManagerPort': 63863, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 62039, 'alive': True, 'Resources': {'memory': 11494.0, 'CPU': 96.0, 'node:10.0.236.239': 1.0, 'object_store_memory': 2632.0}}, {'NodeID': '58373ba7d37b28f92d354afd61e301b2cdafec00595871f610d7bee7', 'Alive': True, 'NodeManagerAddress': '10.0.251.63', 'NodeManagerHostname': 'ip-10-0-251-63', 'NodeManagerPort': 63739, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 55297, 'alive': True, 'Resources': {'memory': 11494.0, 'object_store_memory': 2632.0, 'node:10.0.251.63': 1.0, 'CPU': 96.0}}, {'NodeID': '127ec4a2e3852292d4fdda40aa84c32fb442da6efe94d82e050ac7fb', 'Alive': True, 'NodeManagerAddress': '10.0.163.15', 'NodeManagerHostname': 'ip-10-0-163-15', 'NodeManagerPort': 51164, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 35572, 'alive': True, 'Resources': {'CPU': 96.0, 'memory': 11494.0, 'node:10.0.163.15': 1.0, 'object_store_memory': 2632.0}}, {'NodeID': 'd8a01cebffaf49f1dfa3f25f98c270ddb28a636b8198821b740d449f', 'Alive': True, 'NodeManagerAddress': '10.0.220.87', 'NodeManagerHostname': 'ip-10-0-220-87', 'NodeManagerPort': 64764, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 65083, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'CPU': 96.0, 'node:10.0.220.87': 1.0, 'memory': 11494.0}}, {'NodeID': 'd9f629fbbfe9563894a447fd2e02dd58f600a3f6a292a048ebca3075', 'Alive': True, 'NodeManagerAddress': '10.0.237.108', 'NodeManagerHostname': 'ip-10-0-237-108', 'NodeManagerPort': 64911, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 47240, 'alive': True, 'Resources': {'memory': 11494.0, 'node:10.0.237.108': 1.0, 'object_store_memory': 2632.0, 'CPU': 96.0}}, {'NodeID': '9715717dea169ab98af586bc8fed4bc7a994f01573bba0f8d5e215cb', 'Alive': True, 'NodeManagerAddress': '10.0.250.150', 'NodeManagerHostname': 'ip-10-0-250-150', 'NodeManagerPort': 45665, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 62854, 'alive': True, 'Resources': {'CPU': 96.0, 'node:10.0.250.150': 1.0, 'memory': 11494.0, 'object_store_memory': 2632.0}}, {'NodeID': 'cd4447454d1ba9ebb34057175b2fc1f030ca06fd6453ab44de67eae1', 'Alive': True, 'NodeManagerAddress': '10.0.220.246', 'NodeManagerHostname': 'ip-10-0-220-246', 'NodeManagerPort': 39424, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 44045, 'alive': True, 'Resources': {'node:10.0.220.246': 1.0, 'CPU': 96.0, 'object_store_memory': 2632.0, 'memory': 11494.0}}, {'NodeID': '69727bd66b0a2a9b18ec39cde8ddfa227412634f0a7244cac797b07a', 'Alive': True, 'NodeManagerAddress': '10.0.222.230', 'NodeManagerHostname': 'ip-10-0-222-230', 'NodeManagerPort': 55977, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 65269, 'alive': True, 'Resources': {'CPU': 96.0, 'node:10.0.222.230': 1.0, 'memory': 11494.0, 'object_store_memory': 2632.0}}, {'NodeID': '14dfdbd10a0c2d3324f98cb1a8cf3c96235043690510b54aa4448bf3', 'Alive': True, 'NodeManagerAddress': '10.0.211.58', 'NodeManagerHostname': 'ip-10-0-211-58', 'NodeManagerPort': 52430, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 52835, 'alive': True, 'Resources': {'node:10.0.211.58': 1.0, 'CPU': 96.0, 'object_store_memory': 2632.0, 'memory': 11494.0}}, {'NodeID': '617bac1b4619c19b2c58ca86b8b3cdc066b463b75b70ccd6a0416f81', 'Alive': True, 'NodeManagerAddress': '10.0.194.11', 'NodeManagerHostname': 'ip-10-0-194-11', 'NodeManagerPort': 50406, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 60847, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'node:10.0.194.11': 1.0, 'CPU': 96.0, 'memory': 11494.0}}, {'NodeID': '09456ed25dc56623f14c22990e9e07fea1613f1fc8c36e9b5ff9e09a', 'Alive': True, 'NodeManagerAddress': '10.0.216.160', 'NodeManagerHostname': 'ip-10-0-216-160', 'NodeManagerPort': 62506, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 59842, 'alive': True, 'Resources': {'node:10.0.216.160': 1.0, 'object_store_memory': 2632.0, 'memory': 11494.0, 'CPU': 96.0}}, {'NodeID': '175f87acf7f90727e414e1f88f4859ebc1ef8a0d98f87ef4ddcbb93b', 'Alive': True, 'NodeManagerAddress': '10.0.221.102', 'NodeManagerHostname': 'ip-10-0-221-102', 'NodeManagerPort': 53819, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 59901, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'node:10.0.221.102': 1.0, 'memory': 11494.0, 'CPU': 96.0}}, {'NodeID': 'a892eeb8ad0d49265bc33e3993ba8dfdf66f6e688fc1050dfaa61203', 'Alive': True, 'NodeManagerAddress': '10.0.0.233', 'NodeManagerHostname': 'ip-10-0-0-233', 'NodeManagerPort': 61312, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 59870, 'alive': True, 'Resources': {'node:10.0.0.233': 1.0, 'memory': 11303.0, 'object_store_memory': 2632.0, 'CPU': 96.0}}, {'NodeID': '74e0bed34d140c1ebeebe63cc3e80dbc6249fd1a826cdd97ca02b6d4', 'Alive': True, 'NodeManagerAddress': '10.0.209.174', 'NodeManagerHostname': 'ip-10-0-209-174', 'NodeManagerPort': 60183, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 64626, 'alive': True, 'Resources': {'memory': 11494.0, 'CPU': 96.0, 'node:10.0.209.174': 1.0, 'object_store_memory': 2632.0}}, {'NodeID': '1eb6f8b1a30665a9efab8bc8c8a14c24fcc09ad326546a4dd387ecc8', 'Alive': True, 'NodeManagerAddress': '10.0.195.29', 'NodeManagerHostname': 'ip-10-0-195-29', 'NodeManagerPort': 63885, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 57645, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'memory': 11494.0, 'node:10.0.195.29': 1.0, 'CPU': 96.0}}, {'NodeID': '07bec76dae9fab084b5bff0067651ca7b95e25f6e75f98dfd6e89401', 'Alive': True, 'NodeManagerAddress': '10.0.213.78', 'NodeManagerHostname': 'ip-10-0-213-78', 'NodeManagerPort': 56853, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 54194, 'alive': True, 'Resources': {'memory': 11494.0, 'object_store_memory': 2632.0, 'node:10.0.213.78': 1.0, 'CPU': 96.0}}, {'NodeID': 'eed4ef33c5d5ffd27583b781cb416c7b2970efe0d4b8d8da89436036', 'Alive': True, 'NodeManagerAddress': '10.0.194.75', 'NodeManagerHostname': 'ip-10-0-194-75', 'NodeManagerPort': 53817, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 47826, 'alive': True, 'Resources': {'CPU': 96.0, 'object_store_memory': 2632.0, 'memory': 11494.0, 'node:10.0.194.75': 1.0}}, {'NodeID': '233968ea9c21f80ce5b6aadec2b515d7148161776a6589572bc0ece9', 'Alive': True, 'NodeManagerAddress': '10.0.241.223', 'NodeManagerHostname': 'ip-10-0-241-223', 'NodeManagerPort': 64550, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-11_15-25-08_269647_108/sockets/raylet', 'MetricsExportPort': 60861, 'alive': True, 'Resources': {'object_store_memory': 2632.0, 'memory': 11494.0, 'node:10.0.241.223': 1.0, 'CPU': 96.0}}]

One more thing - I was observing the logs on head node (monitor.out) while setting up a worker and I saw below line.

The notable thing is that shm-size is getting passed twice; --shm-size=204gb --shm-size='"'"'200000000000b'"'"'

Any idea why --shm-size='"'"'200000000000b might be getting passed around? I didn’t specify that.

2021-02-11 16:30:10,670 VVINFO command_runner.py:511 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/f5fc0b9633/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@10.0.170.123 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker run --rm --name miami_container -d -it  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 -v /home/ubuntu/efs:/efs --shm-size=204gb --shm-size='"'"'200000000000b'"'"' --net=host 048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr bash)'`

1 Like

I’m getting the same issue.

…LANG=C.UTF-8 --shm-size=250.00gb --shm-size=’"’"‘200000000000b’"’"’…

@jennakwon06 did you managed to find what was the problem?
@ijrsvt