Unexpected job status

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.39.0
  • Python version: 3.9
  • OS: ubuntu
  • Cloud/Infrastructure: K8s
  • Other libs/tools (if relevant): null

I run ray2.39.0 cluster on k8s, and submit a job with ray job submit -- python job.py.

job.py throw a exception , but the ray job status is successed.
I simplify the job.py to raise xxx or even ray job submit -- exit 1, nothing be better.

I find that job_supervisor.py run the cmd with subprocess, and subprocess says return code is 0, so the ray job is succeed. But I run a simple py script on head node with subprocess subprocess.run("exit 1", shell=True), it returns return_code is 1.

What happend in ray , what should I do to make the job status right?

I add the following code at the end of JobSupervisor#__init__

        return_cmd = subprocess.run(self._entrypoint, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, executable="/bin/bash")
        self._logger.info(return_cmd)
        if return_cmd.returncode == 0:
            self._logger.info("!!!!!!!!")
            ret_val = return_cmd.stdout
        else:
            self._logger.info("###########")

The log in worker-xxx.err is (returncode=0)

2025-04-01 11:21:10,786 INFO job_supervisor.py:139 -- CompletedProcess(args='python image_recognition_single_bak.py', returncode=0, stdout=b'', stderr=b'2025-04-01 11:21:10,123\tINFO worker.py:1494 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS\n2025-04-01 11:21:10,123\tINFO worker.py:1634 -- Connecting to existing Ray cluster at address: 10.7.8.75:6379...\n2025-04-01 11:21:10,130\tINFO worker.py:1810 -- Connected to Ray cluster. View the dashboard at \x1b[1m\x1b[32m10.7.8.75:8265 \x1b[39m\x1b[22m\n[2025-04-01 11:21:10,137 I 334426 334426] logging.cc:293: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1\n') job_id=02000000 worker_id=db1839d26b8be781bfb40444e5937c9c8b18f94c6e17a7490ce0998f node_id=d3bf92cf84d0278ffada4c21fed52e68e7fced5a7e5ecb08b496dcc7 actor_id=f3f56488d615bd2d398ad98b02000000 task_id=fffffffffffffffff3f56488d615bd2d398ad98b02000000
2025-04-01 11:21:10,787 INFO job_supervisor.py:141 -- !!!!!!!! job_id=02000000 worker_id=db1839d26b8be781bfb40444e5937c9c8b18f94c6e17a7490ce0998f node_id=d3bf92cf84d0278ffada4c21fed52e68e7fced5a7e5ecb08b496dcc7 actor_id=f3f56488d615bd2d398ad98b02000000 task_id=fffffffffffffffff3f56488d615bd2d398ad98b02000000

And I make a new python file written the subprocess run code described above, the result is

CompletedProcess(args='python image_recognition_single_bak.py', returncode=1, stdout='', stderr='2025-04-01 11:49:59,368\tINFO worker.py:1494 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS\n2025-04-01 11:49:59,368\tINFO worker.py:1634 -- Connecting to existing Ray cluster at address: 10.7.8.75:6379...\n2025-04-01 11:49:59,377\tINFO worker.py:1810 -- Connected to Ray cluster. View the dashboard at \x1b[1m\x1b[32m10.7.8.75:8265 \x1b[39m\x1b[22m\n[2025-04-01 11:49:59,383 I 342378 342378] logging.cc:293: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1\n')
###########

The image_recognition_single_bak.py content is simple:

sys.exit(1)
# or raise xxx
# or import not_exist_module