Hello - I posted this as an issue but have gotten no feedback so putting this here and crossing my fingers…
After I start a cluster on my schools on-premise cluster with ray up <cluster-config.yaml>
I try to run a test python file on the cluster to ensure everything is working. I tried to do this with ray submit <cluster-config.yaml> test_cluster_remote.py
.
The error I get is:
python: can't open file '/home/ray/test_cluster_remote.py': [Errno 13] Permission denied
I verified that this is just an owner issue, as I attached into the node after and ran ls -l
and got:
total 12
drwxr-xr-x 1 ray users 28 Jun 29 18:14 anaconda3
-rw------- 1 ray users 1259 Jul 21 14:24 ray_bootstrap_config.yaml
-rw------- 1 ray users 1679 Nov 17 2020 ray_bootstrap_key.pem
-rwx------ 1 16302264 16200513 566 Jul 20 11:41 test_cluster_remote.py
inside of this node when I change the owner to ray it works as expected.
Reproduction (REQUIRED)
The code I am trying to run is:
from collections import Counter
import socket
import time
import ray
ray.init(address='auto')
print('''This cluster consists of
{} nodes in total
{} CPU resources in total'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
@ray.remote
def f():
time.sleep(0.1)
# Return IP address
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(500)]
ip_addrs = ray.get(object_ids)
print('Tasks executed')
for ip_addr, num_tasks in Counter(ip_addrs).items():
print(' {} tasks on {}'.format(num_tasks, ip_addr))
and the yaml file I use to start the docker cluster is:
cluster_name: docker-ray-cluster
docker:
#image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
#image: phesse001/ray-container:latest-cpu # use this one if you don't need ML dependencies, it's faster to pull
image: rayproject/ray:latest-cpu
container_name: "ray_cluster_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: [] # Extra options to pass into "docker run"
provider:
type: local
head_ip: 10.6.7.3
worker_ips: [10.6.7.1, 10.6.7.5, 10.6.7.8, 10.6.7.7]
auth:
ssh_user: phesse001
# Optional if an ssh private key is necessary to ssh to the cluster.
ssh_private_key: ~/.ssh/id_rsa
min_workers: 4
max_workers: 4
upscaling_speed: 1.0
idle_timeout_minutes: 5
cluster_synced_files: []
setup_commands: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --object-store-memory=1000000000
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --object-store-memory=1000000000