How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi,
At the moment I am using ray-core to run several experiments in parallel. The
goal is to submit the script to a GCP instance and let the autoscaler create and
remove workers as needed.
The folder structure in my local machine looks something like this:
github_repository/
├─ my_module/
│ ├─ __init__.py
│ ├─ some_useful_functions.py
│ ├─ script_to_run.py
The file script_to_run.py
has the python code that I want to run and looks
something like this:
import ray
import tensorflow
import numpy as np
import my_module
from my_module.some_useful_functions import useful_function
@ray.remote()
def run_experiment_with_ray(args):
# run python code
pass
def main():
ray.init(address='auto',
runtime_env={'py_modules': [my_module]})
for experiment in all_experiments_ro_run:
run_ids = run_experiment_with_ray.remote(experiment)
ray.get(run_ids)
The script runs perfectly on my local machine. And it also runs if I submit it to gcp cluster with zero workers. That is, the script will run perfectly in the head node. But it fails whenever some process is sent to a head node.
To set up the head node I added the following line to the head_setup commands:
head_setup_commands:
- git clone my_repo && cd my_repo && pip install -e && cd ..
The .yaml file is as follows:
cluster_name: hadamard
max_workers: 3
upscaling_speed: 1.0
docker:
image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
container_name: "ray_container"
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
idle_timeout_minutes: 5
provider:
type: gcp
region: europe-west1
availability_zone: europe-west1-c
project_id: ray-cluster-352608 # Globally unique project id
auth:
ssh_user: ubuntu
available_node_types:
ray_head_default:
resources: {"CPU": 2}
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
ray_worker_small:
min_workers: 0
max_workers: 3
resources: {"CPU": 2}
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
scheduling:
- preemptible: true
head_node_type: ray_head_default
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands: []
setup_commands: []
head_setup_commands:
- sudo apt-get install libpython3.7
- pip install --upgrade pip
- pip install google-api-python-client==1.7.8
- cd my_repository && pip -e . && cd ..
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
head_node: {}
worker_nodes: {}
And, with this configuration I get the following error:
(scheduler +21m0s) Resized to 4 CPUs.
Traceback (most recent call last):
File "/home/ray/train_dqn_emp_ray.py", line 269, in <module>
app.run(main)
File "/home/ray/anaconda3/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/ray/anaconda3/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/ray/train_dqn_emp_ray.py", line 265, in main
ray.get(finished)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1831, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::train_with_ray() (pid=197, ip=10.132.0.36)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RaySystemError: System error: No module named 'keras.saving.pickle_utils'
traceback: Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py", line 340, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py", line 237, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py", line 192, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py", line 180, in _deserialize_pickle5_data
obj = pickle.loads(in_band, buffers=buffers)
ModuleNotFoundError: No module named 'keras.saving.pickle_utils'
But this only happens after the worker is lauched. That is, for a few minutes I have to instances of train_with_ray
running in the head node but, when the new worker is lauched, i get the previous error.
Versions
- Ray version: 1.13.0