[Docker] [Multi-Node] ray submit fails due to permissions inside rayproject/ray:latest-cpu container

phesse001 · July 29, 2021, 8:16pm

Hello - I posted this as an issue but have gotten no feedback so putting this here and crossing my fingers…

After I start a cluster on my schools on-premise cluster with ray up <cluster-config.yaml> I try to run a test python file on the cluster to ensure everything is working. I tried to do this with ray submit <cluster-config.yaml> test_cluster_remote.py.

The error I get is:
python: can't open file '/home/ray/test_cluster_remote.py': [Errno 13] Permission denied

I verified that this is just an owner issue, as I attached into the node after and ran ls -l
and got:

total 12
drwxr-xr-x 1 ray      users      28 Jun 29 18:14 anaconda3
-rw------- 1 ray      users    1259 Jul 21 14:24 ray_bootstrap_config.yaml
-rw------- 1 ray      users    1679 Nov 17  2020 ray_bootstrap_key.pem
-rwx------ 1 16302264 16200513  566 Jul 20 11:41 test_cluster_remote.py

inside of this node when I change the owner to ray it works as expected.

Reproduction (REQUIRED)

The code I am trying to run is:

from collections import Counter
import socket
import time

import ray

ray.init(address='auto')
print('''This cluster consists of
  {} nodes in total
  {} CPU resources in total'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

@ray.remote
def f():
  time.sleep(0.1)
  # Return IP address
  return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(500)]
ip_addrs = ray.get(object_ids)

print('Tasks executed')
for ip_addr, num_tasks in Counter(ip_addrs).items():
  print('    {} tasks on {}'.format(num_tasks, ip_addr))

and the yaml file I use to start the docker cluster is:

cluster_name: docker-ray-cluster 

docker:
     #image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
     #image: phesse001/ray-container:latest-cpu  # use this one if you don't need ML dependencies, it's faster to pull
     image: rayproject/ray:latest-cpu
    
     container_name: "ray_cluster_container"
     # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
     # if no cached version is present.
     pull_before_run: True
     run_options: []  # Extra options to pass into "docker run"

provider:
    type: local
    head_ip: 10.6.7.3
    worker_ips: [10.6.7.1, 10.6.7.5, 10.6.7.8, 10.6.7.7]
    
auth:
    ssh_user: phesse001
    # Optional if an ssh private key is necessary to ssh to the cluster.
    ssh_private_key: ~/.ssh/id_rsa

min_workers: 4

max_workers: 4

upscaling_speed: 1.0

idle_timeout_minutes: 5

cluster_synced_files: []

setup_commands: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --object-store-memory=1000000000 

worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --object-store-memory=1000000000

Topic		Replies	Views
PermissionError: [Errno 13] Permission denied: '/tmp/ray/session_latest/logs/monitor.log Ray Clusters	2	1448	May 25, 2022
Problems lauching gcp cluster Ray Core	4	727	July 7, 2022
Ray workers can't ssh to head node Ray Core	5	752	June 14, 2022
Permission denied with local cluster Ray Clusters	7	2430	December 6, 2022
Ray Image does not seems to have python only when used in aws cluster Ray Clusters	0	221	October 23, 2023

[Docker] [Multi-Node] ray submit fails due to permissions inside rayproject/ray:latest-cpu container

Reproduction (REQUIRED)

Related topics