Runtime env docker image: Check failed: !job_id_env.empty() error

The docker image is created by

FROM ubuntu:20.04
RUN apt-get update -y
RUN apt-get install -y python3 python3-pip
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN pip3 install --no-cache-dir absl-py ray[default]
RUN pip3 install --no-cache-dir jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
ENTRYPOINT [ "python3" , "--version"]

And when I run the job, I have the following error

{"command_prefix": ["cd", "/tmp/ray/session_2023-05-04_05-54-58_484700_14533/runtime_resources/working_dir_files/_ray_pkg_6068c19fb3b8530f", "&&"], "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "PYTHONPATH": "/tmp/ray/session_2023-05-04_05-54-58_484700_14533/runtime_resources/working_dir_files/_ray_pkg_6068c19fb3b8530f"}, "py_executable": "docker run -v /tmp/ray:/tmp/ray --network=host --privileged --pid=host --ipc=host --env RAY_RAYLET_PID=15100 -v /home/yejingxin/ray_venv/lib/python3.8/site-packages/ray:/home/yejingxin/ray_venv/lib/python3.8/site-packages/ray --entrypoint python3 gcr.io/cloud-tpu-v2-images/grpc_tpu_worker_v4:yejingxin-debug", "resources_dir": null, "container": {}, "java_jars": []}
[2023-05-04 05:55:10,491 C 15714 15714] core_worker.cc:50:  Check failed: !job_id_env.empty() 
*** StackTrace Information ***
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xd7dcaa) [0x7f3a4b1f6caa] ray::operator<<()
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xd7f792) [0x7f3a4b1f8792] ray::SpdLogMessage::Flush()
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x7f3a4b1f8aa7] ray::RayLog::~RayLog()
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core15GetProcessJobIDERKNS0_17CoreWorkerOptionsE+0x10b) [0x7f3a4aafbdfb] ray::core::GetProcessJobID()
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core10CoreWorkerC1ERKNS0_17CoreWorkerOptionsERKNS_8WorkerIDE+0x8a) [0x7f3a4aafbf0a] ray::core::CoreWorker::CoreWorker()
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImplC2ERKNS0_17CoreWorkerOptionsE+0x587) [0x7f3a4ab03167] ray::core::CoreWorkerProcessImpl::CoreWorkerProcessImpl()
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess10InitializeERKNS0_17CoreWorkerOptionsE+0xcf) [0x7f3a4ab041bf] ray::core::CoreWorkerProcess::Initialize()
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x53a895) [0x7f3a4a9b3895] __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
/usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x53bf33) [0x7f3a4a9b4f33] __pyx_tp_new_3ray_7_raylet_CoreWorker()
ray::IDLE(_PyObject_MakeTpCall+0x183) [0x5f6f43] _PyObject_MakeTpCall
ray::IDLE(_PyEval_EvalFrameDefault+0x5dae) [0x57107e] _PyEval_EvalFrameDefault
ray::IDLE(_PyEval_EvalCodeWithName+0x26a) [0x569cea] _PyEval_EvalCodeWithName
ray::IDLE(_PyFunction_Vectorcall+0x393) [0x5f6a13] _PyFunction_Vectorcall
ray::IDLE(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1] _PyEval_EvalFrameDefault
ray::IDLE(_PyEval_EvalCodeWithName+0x26a) [0x569cea] _PyEval_EvalCodeWithName
ray::IDLE(PyEval_EvalCode+0x27) [0x68e7b7] PyEval_EvalCode
ray::IDLE() [0x680001]
ray::IDLE() [0x68007f]
ray::IDLE() [0x680121]
ray::IDLE(PyRun_SimpleFileExFlags+0x197) [0x680db7] PyRun_SimpleFileExFlags
ray::IDLE(Py_RunMain+0x212) [0x6b8122] Py_RunMain
ray::IDLE(Py_BytesMain+0x2d) [0x6b84ad] Py_BytesMain
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f3a4bda2083] __libc_start_main
ray::IDLE(_start+0x2e) [0x5fb39e] _start

@GuyangSong Could you help me with the issue?

Can you show me your code?

client = JobSubmissionClient("http://127.0.0.1:8265")

job_ids = []
for _ in range(1):
  job_id = client.submit_job(
      # Entrypoint shell command to execute
      entrypoint=(
         "python3 -c \"import jax; print('num: ', jax.device_count())\""
      ),
      # Working dir
      runtime_env={
          "working_dir": "/home/yejingxin/docker_ray",
          "pip": [],
          "container":{"image": "gcr.io/cloud-tpu-v2-images/grpc_tpu_worker_v4:yejingxin-debug",
          #"worker_path": "/usr/local/lib/python3.8/dist-packages/ray/_private/workers/default_worker.py"
          "run_options": ["--env PATH=/home/yejingxin/ray2/bin:$PATH", "-v /home/yejingxin/ray2/:/home/yejingxin/ray2/"],
          },
      },
  )
  job_ids.append(job_id)

print(job_ids)

I also modified container driver to be docker, since podman does not work well for us

container_driver = "docker"
        container_command = [
            container_driver,
            "run",
            "-v",
            self._ray_tmp_dir + ":" + self._ray_tmp_dir,
            #"--cgroup-manager=cgroupfs",
            "--privileged",
            "--network=host",
            "--pid=host",
            "--ipc=host",
            #"--env-host",
        ]

I’m sorry that the container feature has been broken for a long time and I’m not sure you it can work for you now. :rofl:
What’s your ray version?

it is 2.4.0

could you give some instruction to debug the issue?

If there is a potential fix, which part of the code we should look at?

What happened after docker run image default_worker.py ?

Docker does not support --env-host, is it the root cause that ray cannot get job id info from env vars?

Yep, It seems an issue about env vars. RAY_JOB_ID should be forwarded to the worker process in docker.

do you know where RAY_JOB_ID is set? I am trying to think whether it is possible to save those info into files, and restore from file, instead relying on env vars.

You can search it in the ray repo, it is set by raylet.

tried to run the example run

job_id = client.submit_job(
      # Entrypoint shell command to execute
      entrypoint=(
         "python3 -c \"import ray; print('ncsoft: ', ray.__version__)\""
      ),
      # Working dir
      runtime_env={     
          "container":{"image": "docker.io/anyscale/ray-ml:nightly-py38-cpu", 
          "run_options": ["--tmpfs /tmp:rw", "--env PATH=/home//ray/anaconda3/bin:$PATH", "--cap-drop SYS_ADMIN","--log-level=debug", "-v /home/yejingxin/ray_venv/ray/:/home/yejingxin/ray_venv/ray/"],
          },
      },
  )

got the tmp lock file permission error

103time="2023-06-12T07:20:22Z" level=debug msg="Enabling signal proxying"

104Traceback (most recent call last):

105 File "/home/yejingxin/ray_venv/ray/python/ray/_private/workers/default_worker.py", line 203, in <module>

106 node = ray._private.node.Node(

107 File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/node.py", line 255, in __init__

108 self.metrics_agent_port = self._get_cached_port(

109 File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/node.py", line 851, in _get_cached_port

110 with FileLock(file_path + ".lock"):

111 File "/home/ray/anaconda3/lib/python3.8/site-packages/filelock/_api.py", line 255, in __enter__

112 self.acquire()

113 File "/home/ray/anaconda3/lib/python3.8/site-packages/filelock/_api.py", line 213, in acquire

114 self._acquire()

115 File "/home/ray/anaconda3/lib/python3.8/site-packages/filelock/_unix.py", line 37, in _acquire

116 fd = os.open(self.lock_file, open_flags, self._context.mode)

117PermissionError: [Errno 13] Permission denied: '/tmp/ray/session_2023-06-12_07-20-11_997308_503491/ports_by_node.json.lock'