How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello,
I have the following situation:
I use this file to start up a ray-cluster:
cluster_name: test2
max_workers: 2
upscaling_speed: 1.0
docker:
image: rayproject/ray:latest-cpu
container_name: "ray_container"
pull_before_run: False
run_options:
- --ulimit nofile=65536:65536
idle_timeout_minutes: 5
provider:
type: azure
location: westeurope
resource_group: testing-ray-training
cache_stopped_nodes: False
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/id_rsa
ssh_public_key: ~/.ssh/id_rsa.pub
available_node_types:
ray.head.default:
resources: {"CPU": 2}
node_config:
azure_arm_parameters:
vmSize: Standard_D2ads_v5
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: latest
ray.worker.default:
min_workers: 0
max_workers: 2
resources: {"CPU": 2}
node_config:
azure_arm_parameters:
vmSize: Standard_D2ads_v5
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: latest
priority: Spot
head_node_type: ray.head.default
file_mounts: {
"~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub",
"~/conda_environment.yaml": "/home/testuser/environment.yaml"
}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands:
# enable docker setup
- sudo usermod -aG docker $USER || true
- sleep 10 # delay to avoid docker permission denied errors
# get rid of annoying Ubuntu message
- touch ~/.sudo_as_admin_successful
setup_commands: []
# - conda env create --name=testing --file=~/conda_environment.yaml
# - conda activate testing
head_setup_commands:
- conda env create --name=testing --file=~/conda_environment.yaml
worker_setup_commands:
- conda env create --name=testing --file=~/conda_environment.yaml
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}
This will setup a cluster with one head and optional 2 workers.
Now I submit a job to it using python submit.py
, where submit.py
is the following file:
import testing
from ray.job_submission import JobSubmissionClient
client = JobSubmissionClient("http://127.0.0.1:8265")
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint='python ray_example.py',
# Runtime environment for the job, specifying a working directory and pip package
runtime_env={
'py_modules': [testing],
'working_dir': './'
}
)
where ray_example.py
is a simple python script using ray:
import time
import socket
import ray
from pprint import pprint
@ray.remote
def f(x):
time.sleep(x)
return socket.gethostname()
def test(num_operations: int):
start = time.time()
run = [f.remote(1) for _ in range(num_operations)]
res = ray.get(run)
end = time.time()
print('duration in seconds (verbose):', end - start)
counts = {x: res.count(x) for x in res}
print(counts)
if __name__ == '__main__':
import os
os.system('pwd')
ray.init(address='auto')
print("Ray was initialized")
pprint(ray.nodes())
test(200)
pprint(ray.nodes())
ray.shutdown()
print("Ray was shut down")
All of this works.
Now, when I want to use my conda environment, which I set up as part of the installation setup_script
for the head and workers, I change the runtime_env
variable in submit.py
to
runtime_env={
'py_modules': [testing],
'working_dir': './',
'conda': 'testing'
}
I did check, that this environment exists on the head with ray attach
. I cannot check if it exists on the workers.
If I now submit the job again with that change, my submitted job stays as status='PENDING'
forever, when I run ray job list --address="http://127.0.0.1:8265"
.
'raysubmit_SUhJu4BkVnUdrYhx': JobInfo(status='PENDING', entrypoint='python ray_example.py', message='Job has not started yet, likely waiting for the runtime_env to be set up.', error_type=None, start_time=1653399139, end_time=None, metadata={}, runtime_env={'py_modules': ['gcs://_ray_pkg_a37545df4f19625d.zip'], 'working_dir': 'gcs://_ray_pkg_05ca626e61fe7840.zip', 'conda': 'testing', '_ray_commit': '{{RAY_COMMIT_SHA}}'}),
Some additional information:
The conda environment contains
- python==3.9.10
- ray-core=1.9.2=py39h714431f_0
- ray-dashboard=1.9.2=py39h9f3bf79_0
- ray-default=1.9.2=py39hf3d152e_0
- ray-tune=1.9.2=py39hf3d152e_0
amongst many other conda packages, this is unfortunately fixed due to conda conflicts otherwise.
What am I doing wrong? What is the right way to define a custom conda environment (also using a different ray version than the cluster itself, which is 1.12.1 I believe)?
Thank you!