Exception when ray up

KelvinKarRoy · August 11, 2023, 9:37am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.
High

I run the “ray up -y default-full.yaml”. Exception occurs When it is initializing command runner [5/7].
Unable to deserialize image_env to Python object. The image_env is:
Good morning centos

Hostname …: sh-prod-aigame-gpu-1
Release …: CentOS Linux release 7.9.2009 (Core)
Users …: Currently 2 user(s) logged on
===========================================================================
Current user …: centos
CPU usage …: 0.02, 0.05, 0.05 (1, 5, 15 min)
Memory used …: 1809 MB / 32011 MB
Swap in use …: 0 MB
Processes …: 185 running
System uptime …: 4 days 0 hours 38 minutes 16 seconds
Disk space SYS …: remaining
Disk space DATA …: 499G remaining
===========================================================================

[“PATH=/home/ray/anaconda3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin”,“CUDA_VERSION=11.0.3”,“LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64”,“NVIDIA_VISIBLE_DEVICES=all”,“NVIDIA_DRIVER_CAPABILITIES=compute,utility”,“NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451”,“NCCL_VERSION=2.7.8”,“LIBRARY_PATH=/usr/local/cuda/lib64/stubs”,“CUDNN_VERSION=8.0.4.30”,“TZ=America/Los_Angeles”,“HOME=/home/ray”,“LC_ALL=C.UTF-8”,“LANG=C.UTF-8”]
2023-08-11 09:32:57,103 INFO node_provider.py:116 – ClusterState: Writing cluster state: [‘172.23.1.175’, ‘172.23.0.224’]
New status: update-failed
!!!
Expecting value: line 1 column 1 (char 0)
!!!

Exception in thread Thread-1:
Traceback (most recent call last):
File “/usr/lib64/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/usr/local/lib/python3.6/site-packages/ray/autoscaler/_private/updater.py”, line 153, in run
self.do_update()
File “/usr/local/lib/python3.6/site-packages/ray/autoscaler/_private/updater.py”, line 445, in do_update
sync_run_yet=True,
File “/usr/local/lib/python3.6/site-packages/ray/autoscaler/_private/command_runner.py”, line 781, in run_init
raise e
File “/usr/local/lib/python3.6/site-packages/ray/autoscaler/_private/command_runner.py”, line 772, in run_init
for env_var in json.loads(image_env):
File “/usr/lib64/python3.6/json/init.py”, line 354, in loads
return _default_decoder.decode(s)
File “/usr/lib64/python3.6/json/decoder.py”, line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib64/python3.6/json/decoder.py”, line 357, in raw_decode
raise JSONDecodeError(“Expecting value”, s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Failed to setup head node.

jbcdnr · August 17, 2023, 8:28am

I am also getting this error, any update on this?

isaacrob · September 25, 2023, 7:03pm

I am also getting this error, any updates?

ben-directai · September 25, 2023, 9:37pm

Also getting this error. Updates would be extremely helpful!

ben-directai · September 25, 2023, 11:19pm

Note that:

Good morning centos

Hostname …: sh-prod-aigame-gpu-1
Release …: CentOS Linux release 7.9.2009 (Core)
Users …: Currently 2 user(s) logged on
===========================================================================
Current user …: centos
CPU usage …: 0.02, 0.05, 0.05 (1, 5, 15 min)
Memory used …: 1809 MB / 32011 MB
Swap in use …: 0 MB
Processes …: 185 running
System uptime …: 4 days 0 hours 38 minutes 16 seconds
Disk space SYS …: remaining
Disk space DATA …: 499G remaining
===========================================================================

is stated to be part of image_env. For some reason, the inspection of the environment variables is yielding centos boilerplate. You want image_env to exclusively be

[“PATH=/home/ray/anaconda3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin”,“CUDA_VERSION=11.0.3”,“LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64”,“NVIDIA_VISIBLE_DEVICES=all”,“NVIDIA_DRIVER_CAPABILITIES=compute,utility”,“NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451”,“NCCL_VERSION=2.7.8”,“LIBRARY_PATH=/usr/local/cuda/lib64/stubs”,“CUDNN_VERSION=8.0.4.30”,“TZ=America/Los_Angeles”,“HOME=/home/ray”,“LC_ALL=C.UTF-8”,“LANG=C.UTF-8”]

I was running into a similar problem that was resolved by adding lines to the initialization_commands of my yaml to ensure that there were no unexpected prefixes to the JSON string.

bjornah · November 7, 2023, 6:13pm

I had the same issue. It turns out that my problem was that the image_env got some error messages from the bash shell prepended to it, which caused the json decoder to fail. I resolved it by following command line - Error message when opening the terminal "-bash: /usr/bin/tclsh: No such file or directory" - Ask Ubuntu

Topic		Replies	Views
[Ray K8s cluster] - Script exit	0	250	July 8, 2023
Handle "Cuda out of memory" exception on ray serve replica Ray Serve	3	855	December 5, 2022
ray.exceptions.OwnerDiedError: Failed to retrieve object Ray Clusters	4	1507	July 7, 2022
[Core] How to reslove RayOutOfMemoryError in python for ray package? Ray Core	5	882	April 29, 2021
Ray 1.7.0 ray.init(runtime_env=) kills cluster (was: cluster stuck on "The actor or task with ID [] cannot be scheduled right now") Ray Core	5	1178	October 18, 2021

Exception when ray up

Related Topics