Custom conda environment does not allow jobs to execute

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello,

I have the following situation:

I use this file to start up a ray-cluster:

cluster_name: test2
max_workers: 2
upscaling_speed: 1.0

docker:
    image: rayproject/ray:latest-cpu
    container_name: "ray_container"
    pull_before_run: False
    run_options:
        - --ulimit nofile=65536:65536
idle_timeout_minutes: 5

provider:
    type: azure
    location: westeurope
    resource_group: testing-ray-training
    cache_stopped_nodes: False
auth:
    ssh_user: ubuntu
    ssh_private_key: ~/.ssh/id_rsa
    ssh_public_key: ~/.ssh/id_rsa.pub
available_node_types:
    ray.head.default:
        resources: {"CPU": 2}
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2ads_v5
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest

    ray.worker.default:
        min_workers: 0
        max_workers: 2
        resources: {"CPU": 2}
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2ads_v5
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                priority: Spot

head_node_type: ray.head.default

file_mounts: {
     "~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub",
     "~/conda_environment.yaml": "/home/testuser/environment.yaml"
}
cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

setup_commands: []
#    - conda env create --name=testing --file=~/conda_environment.yaml
#    - conda activate testing

head_setup_commands:
    - conda env create --name=testing --file=~/conda_environment.yaml

worker_setup_commands:
    - conda env create --name=testing --file=~/conda_environment.yaml


head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

This will setup a cluster with one head and optional 2 workers.
Now I submit a job to it using python submit.py, where submit.py is the following file:

import testing
from ray.job_submission import JobSubmissionClient


client = JobSubmissionClient("http://127.0.0.1:8265")

job_id = client.submit_job(
    # Entrypoint shell command to execute
    entrypoint='python ray_example.py',
    # Runtime environment for the job, specifying a working directory and pip package
    runtime_env={
      'py_modules': [testing],
      'working_dir': './'
    }
)

where ray_example.py is a simple python script using ray:

import time
import socket
import ray
from pprint import pprint

@ray.remote
def f(x):
  time.sleep(x)
  return socket.gethostname()


def test(num_operations: int):
  start = time.time()
  run = [f.remote(1) for _ in range(num_operations)]
  res = ray.get(run)
  end = time.time()
  print('duration in seconds (verbose):', end - start)
  counts = {x: res.count(x) for x in res}
  print(counts)


if __name__ == '__main__':
  import os
  os.system('pwd')
  ray.init(address='auto')
  print("Ray was initialized")
  pprint(ray.nodes())
  test(200)
  pprint(ray.nodes())
  ray.shutdown()

  print("Ray was shut down")

All of this works.

Now, when I want to use my conda environment, which I set up as part of the installation setup_script for the head and workers, I change the runtime_env variable in submit.py to

    runtime_env={
      'py_modules': [testing],
      'working_dir': './',
     'conda': 'testing'
    }

I did check, that this environment exists on the head with ray attach. I cannot check if it exists on the workers.
If I now submit the job again with that change, my submitted job stays as status='PENDING' forever, when I run ray job list --address="http://127.0.0.1:8265".

 'raysubmit_SUhJu4BkVnUdrYhx': JobInfo(status='PENDING', entrypoint='python ray_example.py', message='Job has not started yet, likely waiting for the runtime_env to be set up.', error_type=None, start_time=1653399139, end_time=None, metadata={}, runtime_env={'py_modules': ['gcs://_ray_pkg_a37545df4f19625d.zip'], 'working_dir': 'gcs://_ray_pkg_05ca626e61fe7840.zip', 'conda': 'testing', '_ray_commit': '{{RAY_COMMIT_SHA}}'}),

Some additional information:
The conda environment contains

- python==3.9.10
 - ray-core=1.9.2=py39h714431f_0
  - ray-dashboard=1.9.2=py39h9f3bf79_0
  - ray-default=1.9.2=py39hf3d152e_0
  - ray-tune=1.9.2=py39hf3d152e_0

amongst many other conda packages, this is unfortunately fixed due to conda conflicts otherwise.

What am I doing wrong? What is the right way to define a custom conda environment (also using a different ray version than the cluster itself, which is 1.12.1 I believe)?

Thank you!

Thanks for the great details in this question!

@architkulkarni, I see you implemented conda specification in runtime_env. Do you have any idea what is going wrong here?

cc @eoakes , I think @architkulkarni is out.

Hi there,

@cade @architkulkarni @eoakes I did some more debugging, I am not sure if this information is helpful, but here you go:

EDIT:

If I run my code without the custom conda environment ray runs fine.
I now also logged into the head and had a look at ray status and logged into the worker and docker exec -it XXXXXXXXXXX /bin/bash-ed into the worker to see what is happening there.

The difference when running it with and without custom conda environment as a runtime_env argument is, that with conda nothing ever reaches the worker.
If the workers are not running yet the are not even started.
The status of the ob will always just be PENDING.

So it seems like even though everything is in a conda environment that is supposed to run my code, the python and ray versions must match the python and ray versions in the docker image.

I also tried using a docker image that is closer in version to what I need, but the closes I could find was python=3.9.5 and ray=1.12.1, but that did not help

Is there no way to have a conda environment which is in version etc completely decoupled from the cluster? What I expected to happen was that the docker-images ray/python installation just works in a managing fashion and does not interfere with the code to be executed on the nodes.

Is there a way to have different python/ray versions in the docker image and conda environment?

If not, then I need to build my own docker image. Here the question is what is required and what is not. I had a look at rays docker images and there is a ton of stuff in there :slight_smile: .
Is there a minimal “DIY” docker image which I could build upon?
I found this: ray/build-docker.sh at master · ray-project/ray · GitHub but on first glance it’s not quite clear how this is supposed to be used.

TL;DR:

  1. Can I have a separate python/ray installation in the docker image and the conda environment I want to use? If yes, how? If not, what is the best alternative?
  2. In case building my own docker image is the only way to go, what needs to be in the docker image and what can I leave out?

Thank you!

Update: this seems to come from OpenShift using a restrictive uid-range by default. For future readers: make sure to set your uid-range on your namespace:

kubectl annotate namespace ray --overwrite openshift.io/sa.scc.uid-range=1000/10000

History

I also am unable to use conda with a runtime_env, when using the helm chart to deploy to a kube cluster. The relevant error messager seems to be:

NoWritablePkgsDirError: No writeable pkgs directories configured.
 - /home/ray/anaconda3/pkgs
 - /home/ray/.conda/pkgs

I have tried both with and without a prefix entry in the conda environment.yml. And I have tried both the default ray base image, and the ray nightly GPU base image.

I have also tried running the conda env create -f myenv.yml inside a docker container, and it works. Is it possible that ray is running that command as some other user??

environment.yml

name: basepython3
channels:
  - pytorch
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=4.5=1_gnu
  - blas=1.0=mkl
  - brotlipy=0.7.0=py36h27cfd23_1003
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2021.10.26=h06a4308_2
  - certifi=2021.5.30=py36h06a4308_0
  - cffi=1.14.6=py36h400218f_0
  - charset-normalizer=2.0.4=pyhd3eb1b0_0
  - conda=4.10.3=py36h06a4308_0
  - conda-package-handling=1.7.3=py36h27cfd23_1
  - cryptography=35.0.0=py36hd23ed53_0
  - cudatoolkit=11.3.1=h2bc3f7f_2
  - dataclasses=0.8=pyh4f3eec9_6
  - ffmpeg=4.2.2=h20bf706_0
  - freetype=2.11.0=h70c0345_0
  - gmp=6.2.1=h2531618_2
  - gnutls=3.6.15=he1e5248_0
  - idna=3.3=pyhd3eb1b0_0
  - intel-openmp=2022.0.1=h06a4308_3633
  - jpeg=9d=h7f8727e_0
  - lame=3.100=h7b6447c_0
  - lcms2=2.12=h3be6417_0
  - ld_impl_linux-64=2.35.1=h7274673_9
  - libffi=3.3=he6710b0_2
  - libgcc-ng=9.3.0=h5101ec6_17
  - libgomp=9.3.0=h5101ec6_17
  - libidn2=2.3.2=h7f8727e_0
  - libopus=1.3.1=h7b6447c_0
  - libpng=1.6.37=hbc83047_0
  - libstdcxx-ng=9.3.0=hd4cf53a_17
  - libtasn1=4.16.0=h27cfd23_0
  - libtiff=4.2.0=h85742a9_0
  - libunistring=0.9.10=h27cfd23_0
  - libuv=1.40.0=h7b6447c_0
  - libvpx=1.7.0=h439df22_0
  - libwebp-base=1.2.2=h7f8727e_0
  - lz4-c=1.9.3=h295c915_1
  - mkl=2020.2=256
  - mkl-service=2.3.0=py36he8ac12f_0
  - mkl_fft=1.3.0=py36h54f3939_0
  - mkl_random=1.1.1=py36h0573a6f_0
  - ncurses=6.3=h7f8727e_2
  - nettle=3.7.3=hbbd107a_1
  - numpy=1.19.2=py36h54aff64_0
  - numpy-base=1.19.2=py36hfa32c7d_0
  - olefile=0.46=py36_0
  - openh264=2.1.1=h4ff587b_0
  - openjpeg=2.4.0=h3ad879b_0
  - openssl=1.1.1m=h7f8727e_0
  - pillow=8.3.1=py36h2c7a002_0
  - pip=21.2.2=py36h06a4308_0
  - pycosat=0.6.3=py36h27cfd23_0
  - pycparser=2.21=pyhd3eb1b0_0
  - pyopenssl=22.0.0=pyhd3eb1b0_0
  - pysocks=1.7.1=py36h06a4308_0
  - python=3.6.13=h12debd9_1
  - pytorch=1.10.2=py3.6_cuda11.3_cudnn8.2.0_0
  - pytorch-mutex=1.0=cuda
  - readline=8.1.2=h7f8727e_1
  - requests=2.27.1=pyhd3eb1b0_0
  - ruamel_yaml=0.15.100=py36h27cfd23_0
  - setuptools=58.0.4=py36h06a4308_0
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.37.2=hc218d9a_0
  - tk=8.6.11=h1ccaba5_0
  - torchaudio=0.10.2=py36_cu113
  - torchvision=0.11.3=py36_cu113
  - tqdm=4.62.3=pyhd3eb1b0_1
  - typing_extensions=3.10.0.2=pyh06a4308_0
  - urllib3=1.26.8=pyhd3eb1b0_0
  - wheel=0.37.1=pyhd3eb1b0_0
  - x264=1!157.20191217=h7b6447c_0
  - xz=5.2.5=h7b6447c_0
  - yaml=0.2.5=h7b6447c_0
  - zlib=1.2.11=h7f8727e_4
  - zstd=1.4.9=haebb681_0
  - pip:
    - attrs==21.4.0
    - click==8.0.4
    - deprecated==1.2.13
    - filelock==3.4.1
    - grpcio==1.44.0
    - importlib-metadata==4.8.3
    - jsonschema==3.2.0
    - msgpack==1.0.3
    - packaging==21.3
    - protobuf==3.19.4
    - pyparsing==3.0.7
    - pyrsistent==0.18.0
    - pyyaml==6.0
    - ray==1.10.0
    - redis==4.1.4
    - wrapt==1.13.3
    - zipp==3.6.0
prefix: /tmp

@M_S the Python & Ray versions in the conda environment currently need to match those on the cluster, it’s possible that this might work in some cases, but we don’t make any guarantees.

Sorry for the confusion, I will make sure this is clearly documented to avoid others going down the same path in the future.

Hello!

@starpit not sure if this addresses my original issue, I was not running this in kubernetes and I had no problem creating the environment, I only had problems using it to run my submitted jobs.

@eoakes Good to know, thank you for clarifying. I already read that when you specify the conda or pip environment in runtime_env, that one should not specify ray/python versions for that reason. I thought explicitely creating my own conda environment would be different though.
Is this something that will be supported in the future?
Especially when it comes to python versions, it seems like you have very specific choices (e.g. why 3.9.5 and not some newer 3.9.* or generically any 3.9?).