Exception when ray up

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.
    High

I run the “ray up -y default-full.yaml”. Exception occurs When it is initializing command runner [5/7].
Unable to deserialize image_env to Python object. The image_env is:
Good morning centos

  • Hostname …: sh-prod-aigame-gpu-1
  • Release …: CentOS Linux release 7.9.2009 (Core)
  • Users …: Currently 2 user(s) logged on
    ===========================================================================
  • Current user …: centos
  • CPU usage …: 0.02, 0.05, 0.05 (1, 5, 15 min)
  • Memory used …: 1809 MB / 32011 MB
  • Swap in use …: 0 MB
  • Processes …: 185 running
  • System uptime …: 4 days 0 hours 38 minutes 16 seconds
  • Disk space SYS …: remaining
  • Disk space DATA …: 499G remaining
    ===========================================================================

[“PATH=/home/ray/anaconda3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin”,“CUDA_VERSION=11.0.3”,“LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64”,“NVIDIA_VISIBLE_DEVICES=all”,“NVIDIA_DRIVER_CAPABILITIES=compute,utility”,“NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451”,“NCCL_VERSION=2.7.8”,“LIBRARY_PATH=/usr/local/cuda/lib64/stubs”,“CUDNN_VERSION=8.0.4.30”,“TZ=America/Los_Angeles”,“HOME=/home/ray”,“LC_ALL=C.UTF-8”,“LANG=C.UTF-8”]
2023-08-11 09:32:57,103 INFO node_provider.py:116 – ClusterState: Writing cluster state: [‘172.23.1.175’, ‘172.23.0.224’]
New status: update-failed
!!!
Expecting value: line 1 column 1 (char 0)
!!!

Exception in thread Thread-1:
Traceback (most recent call last):
File “/usr/lib64/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/usr/local/lib/python3.6/site-packages/ray/autoscaler/_private/updater.py”, line 153, in run
self.do_update()
File “/usr/local/lib/python3.6/site-packages/ray/autoscaler/_private/updater.py”, line 445, in do_update
sync_run_yet=True,
File “/usr/local/lib/python3.6/site-packages/ray/autoscaler/_private/command_runner.py”, line 781, in run_init
raise e
File “/usr/local/lib/python3.6/site-packages/ray/autoscaler/_private/command_runner.py”, line 772, in run_init
for env_var in json.loads(image_env):
File “/usr/lib64/python3.6/json/init.py”, line 354, in loads
return _default_decoder.decode(s)
File “/usr/lib64/python3.6/json/decoder.py”, line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib64/python3.6/json/decoder.py”, line 357, in raw_decode
raise JSONDecodeError(“Expecting value”, s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Failed to setup head node.

I am also getting this error, any update on this?

I am also getting this error, any updates?

Also getting this error. Updates would be extremely helpful!

Note that:

Good morning centos

Hostname …: sh-prod-aigame-gpu-1
Release …: CentOS Linux release 7.9.2009 (Core)
Users …: Currently 2 user(s) logged on
===========================================================================
Current user …: centos
CPU usage …: 0.02, 0.05, 0.05 (1, 5, 15 min)
Memory used …: 1809 MB / 32011 MB
Swap in use …: 0 MB
Processes …: 185 running
System uptime …: 4 days 0 hours 38 minutes 16 seconds
Disk space SYS …: remaining
Disk space DATA …: 499G remaining
===========================================================================

is stated to be part of image_env. For some reason, the inspection of the environment variables is yielding centos boilerplate. You want image_env to exclusively be

[“PATH=/home/ray/anaconda3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin”,“CUDA_VERSION=11.0.3”,“LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64”,“NVIDIA_VISIBLE_DEVICES=all”,“NVIDIA_DRIVER_CAPABILITIES=compute,utility”,“NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451”,“NCCL_VERSION=2.7.8”,“LIBRARY_PATH=/usr/local/cuda/lib64/stubs”,“CUDNN_VERSION=8.0.4.30”,“TZ=America/Los_Angeles”,“HOME=/home/ray”,“LC_ALL=C.UTF-8”,“LANG=C.UTF-8”]

I was running into a similar problem that was resolved by adding lines to the initialization_commands of my yaml to ensure that there were no unexpected prefixes to the JSON string.

1 Like

I had the same issue. It turns out that my problem was that the image_env got some error messages from the bash shell prepended to it, which caused the json decoder to fail. I resolved it by following command line - Error message when opening the terminal "-bash: /usr/bin/tclsh: No such file or directory" - Ask Ubuntu

I also ran into this on ray 2.54 with a docker cluster created via ssh. As mentioned previously in the thread, the immediate cause is unexpected output (i.e. MOTD/quota scripts in /etc/profile.d) on login. As ray is trying to parse json produced by commands such as docker inspect, this causes ray up to fail

It seems to be possible to work around this with ray up ... –use-normal-shells but the problem also manifests when the ray autoscaler tries to bring up worker nodes via ssh. This shows up deep in monitor.log and is took some effort to diagnose. Even more problematic is that there isn’t a way to tell the autoscaler to avoid login shells.

However, it is possible to patch ray 2.54 in the ray cluster yaml setup_command like so:

setup_commands:
    # List of shell commands to run to set up each nodes.
    # By default Ray uses login shells (bash --login -i) and pseudo-TTYs (ssh -tt)
    # when SSHing into nodes. This causes .bash_profile output and terminal
    # progress bars to pollute stdout, breaking Ray's JSON parsing of command
    # output. These patches switch the autoscaler to plain non-interactive shells.
    # Two flags must be patched together: use_login_shells (controls -tt and
    # bash --login) and _allow_interactive (controls whether stdin is a pipe or
    # inherited from the parent). If only use_login_shells is patched, the Popen
    # path tries to close p.stdin which is None, causing SSH readiness checks to
    # never succeed.
    - |
        python3 -c "
        import pathlib;
        import ray.autoscaler._private.command_runner as cr;
        import ray.autoscaler._private.subprocess_output_util as su;
        p = pathlib.Path(cr.__file__);
        p.write_text(p.read_text().replace('\"use_login_shells\": True', '\"use_login_shells\": False'));
        p = pathlib.Path(su.__file__);
        p.write_text(p.read_text().replace('_allow_interactive = True', '_allow_interactive = False'))"