Problems lauching gcp cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,

At the moment I am using ray-core to run several experiments in parallel. The
goal is to submit the script to a GCP instance and let the autoscaler create and
remove workers as needed.

The folder structure in my local machine looks something like this:

github_repository/
├─ my_module/
│  ├─ __init__.py
│  ├─ some_useful_functions.py
│  ├─ script_to_run.py

The file script_to_run.py has the python code that I want to run and looks
something like this:

import ray

import tensorflow
import numpy as np

import my_module
from my_module.some_useful_functions import useful_function

@ray.remote()
def run_experiment_with_ray(args):
    # run python code
    pass

def main():
    ray.init(address='auto',
             runtime_env={'py_modules': [my_module]})
    for experiment in all_experiments_ro_run:
        run_ids = run_experiment_with_ray.remote(experiment)

    ray.get(run_ids)

The script runs perfectly on my local machine. And it also runs if I submit it to gcp cluster with zero workers. That is, the script will run perfectly in the head node. But it fails whenever some process is sent to a head node.

To set up the head node I added the following line to the head_setup commands:

head_setup_commands:
  - git clone my_repo && cd my_repo && pip install -e && cd ..

The .yaml file is as follows:

cluster_name: hadamard

max_workers: 3

upscaling_speed: 1.0

docker:
  image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
  container_name: "ray_container"
  pull_before_run: True
  run_options:  # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536

idle_timeout_minutes: 5

provider:
    type: gcp
    region: europe-west1
    availability_zone: europe-west1-c
    project_id: ray-cluster-352608 # Globally unique project id

auth:
    ssh_user: ubuntu

available_node_types:
    ray_head_default:
        resources: {"CPU": 2}
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu

    ray_worker_small:
        min_workers: 0
        max_workers: 3
        resources: {"CPU": 2}
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
            scheduling:
              - preemptible: true


head_node_type: ray_head_default

file_mounts: {}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: []


head_setup_commands:
  - sudo apt-get install libpython3.7
  - pip install --upgrade pip
  - pip install google-api-python-client==1.7.8
  - cd my_repository && pip -e . && cd ..

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

head_node: {}
worker_nodes: {}

And, with this configuration I get the following error:

(scheduler +21m0s) Resized to 4 CPUs.
Traceback (most recent call last):
  File "/home/ray/train_dqn_emp_ray.py", line 269, in <module>
    app.run(main)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/ray/train_dqn_emp_ray.py", line 265, in main
    ray.get(finished)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1831, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::train_with_ray() (pid=197, ip=10.132.0.36)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RaySystemError: System error: No module named 'keras.saving.pickle_utils'
traceback: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py", line 340, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py", line 237, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata_fields)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py", line 192, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py", line 180, in _deserialize_pickle5_data
    obj = pickle.loads(in_band, buffers=buffers)
ModuleNotFoundError: No module named 'keras.saving.pickle_utils'

But this only happens after the worker is lauched. That is, for a few minutes I have to instances of train_with_ray running in the head node but, when the new worker is lauched, i get the previous error.

Versions

  • Ray version: 1.13.0

Looks like some dependencies are not being installed on your worker nodes:

ray.exceptions.RaySystemError: System error: No module named 'keras.saving.pickle_utils'

Try moving your head_setup_commands to setup_commands instead? This will run the set of commands on all nodes instead of just the head node.

Thank you so much for your reply. I moved a few things to the setup_commands:

setup_commands:
  - pip install --upgrade pip
  - pip install tensorflow
  - pip install ...remaining libraries...

But I am faced with the error:

Traceback (most recent call last):
  File "/home/ray/train_dqn_emp_ray.py", line 270, in <module>
    app.run(main)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/ray/train_dqn_emp_ray.py", line 266, in main
    ray.get(finished)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1831, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::train_with_ray() (pid=809, ip=10.132.0.29)
RuntimeError: The remote function failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:

ray::train_with_ray() (pid=809, ip=10.132.0.29)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py", line 698, in subimport
    __import__(name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/acme/__init__.py", line 35, in <module>
    from acme.environment_loop import EnvironmentLoop
  File "/home/ray/anaconda3/lib/python3.7/site-packages/acme/environment_loop.py", line 26, in <module>
    from acme.utils import signals
  File "/home/ray/anaconda3/lib/python3.7/site-packages/acme/utils/signals.py", line 22, in <module>
    import launchpad
  File "/home/ray/anaconda3/lib/python3.7/site-packages/launchpad/__init__.py", line 36, in <module>
    from launchpad.nodes.courier.node import CourierHandle
  File "/home/ray/anaconda3/lib/python3.7/site-packages/launchpad/nodes/courier/node.py", line 21, in <module>
    import courier
  File "/home/ray/anaconda3/lib/python3.7/site-packages/courier/__init__.py", line 26, in <module>
    from courier.python.client import Client  # pytype: disable=import-error
  File "/home/ray/anaconda3/lib/python3.7/site-packages/courier/python/client.py", line 30, in <module>
    from courier.python import py_client
ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
Shared connection to 34.79.87.40 closed.
Error: Command failed:

Which again appears to indicate that some dependencies are not being installed in the worker nodes. For the life of me I cannot think of any more libraries that can be installed with pip that my code needs. Thus, the only thing that could possible be raising the error would be a failure to import my own module. Again, the structure of my code is as follows:

github_repository/
├─ my_module/
│  ├─ __init__.py
│  ├─ some_useful_functions.py
│  ├─ script_to_run.py

The file script_to_run.py has the python code that I want to run and looks
something like this:

import ray

import tensorflow
import numpy as np

import my_module
from my_module.some_useful_functions import useful_function

@ray.remote()
def run_experiment_with_ray(args):
    # run python code
    pass

def main():
    ray.init(address='auto',
             runtime_env={'py_modules': [my_module]})
    for experiment in all_experiments_ro_run:
        run_ids = run_experiment_with_ray.remote(experiment)

    ray.get(run_ids)

But shouldn’t the line of code:

runtime_env={'py_modules': [my_module]}

handle the setup of my_module in the workers?

Also the error

ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

has appeared before when I just trying to run everything in the head node. But the line:

head_setup_commands:
  - sudo apt-get install libpython3.7

solved it. I tried putting this in the setup_commands but now the worker set up runs forever. I am guessing that this line causes the worker setup to get stuck somewhere.

Figured it out:

It needed to be:

head_setup_commands:
  - sudo apt-get -y install libpython3.7

to answer with yes to all prompts and prevent the workers from getting stuck.

Just to clarify, are you still seeing the RuntimeError about the failed import?

If so, I believe the problem still has to do with the setup for worker nodes. You will need to make sure that all worker nodes share the same local file structure as your head node. You can do this by moving the clone and install lines to the general setup_commands or by using file_mounts to sync directories across all nodes.

In general, i would recommend using only setup_commands to make sure all of your nodes are identical instead of overriding head_setup_commands. Let me know how that goes!