How to copy (non-pip) dependencies to cluster nodes

I’m currently trying to move from a single ray instance to a ray cluster. Currently struggling to move my additional python files to the nodes.

When I ray submit my code I get the error
ModuleNotFoundError: No module named 'tools'

To solve this I tried to to use ray rsync-up cluster.yml tools tools but the error persists. Where should the files be copied to so they can be loaded?

I believe that ray rsync-up should be copying the files to the worker nodes – @Ameer_Haj_Ali could you please chime in here?

The files were successfully copied but were not available in the pythonpath during the ray submit. Would they need to be copied to a special directory or something else?

@eoakes , thanks for tagging me. @eicnix, thanks for asking.
@eicnix , this might be happening because ray start is called on the head node in head_node_start_commands, at this point all the python paths are loaded and cached, for ray to pick the new modules you need to call ray stop and ray start... again.
It seems like what you are looking for is filemounts
Please let me know if that works for you.

Thank you for the input!

I’ve got it working using the filemounts file you said and also setting the additional pythonpath using the docker environment variable flag.

It would be great to have some documentation on how to actually use the cluster feature in an nontrivial example as the current documentation only provides “hello world” kind of examples.

Right now I’m facing another issue: Only one of my three nodes is being setup to be used in the cluster. Can you give a pointer why?

I even started the cluster with ray up cluster.yml --min-nodes=2

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

## NOTE: Typically for local clusters, min_workers == max_workers == len(worker_ips).

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 2

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: 
        - -e PYTHONPATH=/home/ray/reinforced-learning  # Extra options to pass into "docker run"

# Local specific configuration.
provider:
    type: local
    head_ip: 192.168.195.3
    worker_ips: [192.168.195.4, 192.168.195.5]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: root
    # Optional if an ssh private key is necessary to ssh to the cluster.
    # ssh_private_key: ~/.ssh/id_rsa

# Leave this empty.
head_node: {}

# Leave this empty.
worker_nodes: {}

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: True

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: 
  - pip install neptune-client
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[full] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --node-ip-address=192.168.195.3 --dashboard-host=192.168.195.3

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

The nodes are supposed to be started are are hanging in state unitialized

2021-03-26 14:51:11,452 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffffaf0ba4bd9518a2c62fc2131202000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {0.000000/12.000000 CPU, 37.597656 GiB/37.597656 GiB memory, 12.939453 GiB/12.939453 GiB object_store_memory, 1.000000/1.000000 node:192.168.195.3}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2021-03-26 14:51:11,872 DEBUG (unknown file):0 -- gc.collect() freed 27 refs in 0.35091638000449166 seconds
(autoscaler +1m26s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +1m26s) Removing 2 nodes of type ray-legacy-worker-node-type (launch failed).
(autoscaler +1m31s) Adding 2 nodes of type ray-legacy-worker-node-type.
======== Autoscaler status: 2021-03-26 23:02:26.988953 ========
Node status
---------------------------------------------------------------
Healthy:

Pending:
 192.168.195.4: ray-legacy-worker-node-type, uninitialized
 192.168.195.5: ray-legacy-worker-node-type, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.00/67.285 GiB memory
 0.00/21.680 GiB object_store_memory
 16.0/16.0 CPU
 1.0/1.0 GPU

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors