How does "container" in "runtime_env" work?

In this document , There has a lot of descriptions about runtime_env, but I don’t known how will it work if I specify some details about container.

I submit a job using this command:

ray job submit 
--address='http://192.168.0.192:8265' 
--runtime-env-json='{"working_dir":"./","container":{"image": "anyscale/ray-ml:nightly-py38-cpu", "worker_path": "/root/python/ray/workers/default_worker.py", "run_options": ["--cap-drop SYS_ADMIN","--log-level=debug"]}}' 
-- python ./debug.py

This is the debug.py.

import ray
ray.init()

@ray.remote
def f(x):
     return x * x
  
 futures = [f.remote(i) for i in range(2)]
 print(ray.get(futures))

Surprisingly, I get some feedback, but the process doesn’t seem to stop. I am not getting the result I want. What’s wrong with it? Is there some prework I haven’t done?

This is the feedback:

Job submission server address: http://192.168.0.192:8265
2022-08-08 10:08:21,309	INFO dashboard_sdk.py:272 -- Uploading package gcs://_ray_pkg_698a6544fb43c3a9.zip.
2022-08-08 10:08:21,310	INFO packaging.py:479 -- Creating a file package for local directory './'.

-------------------------------------------------------
Job 'raysubmit_YVTpnmAzV7ysFmpk' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_YVTpnmAzV7ysFmpk
  Query the status of the job:
    ray job status raysubmit_YVTpnmAzV7ysFmpk
  Request the job to be stopped:
    ray job stop raysubmit_YVTpnmAzV7ysFmpk

Tailing logs until the job exits (disable with --no-wait):

Hey @jie, have your tried ray job logs raysubmit_YVTpnmAzV7ysFmpk to see if there’s any output?

but the process doesn’t seem to stop. 

Does it hang and stuck at the feedback you provided above?

If you query the status of the job with ray job status, what’s the output there?

A few more questions:

  1. Which version of ray are you using?
  2. Do you have ssh access to the head node?

Hi @rickyyx thanks! Yes, it hang and stuck at that feedback.

The output like this:

$ ray job logs raysubmit_SpYXzcr6hVkq973i

Job submission server address: None
2022-08-09 10:55:08,772	INFO dashboard_sdk.py:129 -- No address provided, defaulting to http://localhost:8265.


$ ray job status raysubmit_SpYXzcr6hVkq973i

Job submission server address: None
2022-08-09 10:55:36,855	INFO dashboard_sdk.py:129 -- No address provided, defaulting to http://localhost:8265.
Status for job 'raysubmit_SpYXzcr6hVkq973i': PENDING
Status message: Job has not started yet, likely waiting for the runtime_env to be set up.

And my ray version is 3.0.0.dev0, I have the ssh access to head node.

Moreover, I get some messages from raylet.err and runtime_env_setup-01000000.log:

$ cat raylet.err

bash: 第 0 行: exec: podman:未找到
[2022-08-09 10:26:25,244 E 494944 494944] (raylet) worker_pool.cc:500: Some workers of the worker process(506291) have not registered within the timeout. The process is dead, probably it crashed during start.


$ cat runtime_env_setup-01000000.log

2022-08-09 10:27:25,248 INFO container.py:47 -- start worker in container with prefix: podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --n    etwork=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=494944 --cap-drop SYS_ADMIN --log-level=debug --entrypoint python anyscale/ray-ml:    nightly-py38-cpu

Could you try to set RAY_worker_register_timeout_seconds env var to a larger time (e.g. 300) and see if it works?

Also could you see the worker log to see if it actually crashes?

@GuyangSong can you help answer this user’s questions about container?

Thanks @jjyao!

In the past two days, I tried to solve this problem - I installed Podman locally, manually pulled the image before submitting the job, etc. But there is one new issue that is holding me back at the moment:

time="2022-08-11T09:36:46+08:00" level=warning msg="Error validating CNI config file /home/wangjie/.config/cni/net.d/87-podman.conflist: [netplugin failed with no error mess    age: fork/exec /opt/cni/bin/bridge: exec format error netplugin failed with no error message: fork/exec /opt/cni/bin/portmap: exec format error netplugin failed with no erro    r message: fork/exec /opt/cni/bin/firewall: exec format error netplugin failed with no error message: fork/exec /opt/cni/bin/tuning: exec format error]"
Error: executable file `python` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

I’m sorry to say that the container support is still experimental and has been broken unfortunately
in the latest ray versions. We will enhance this part in few months and make a clear document for it.

Hi @GuyangSong, any update on this? I just tried to start a local Ray cluster via ray.init with a local image (for automated testing), and ran into the same issue:

Error: executable file python not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found