How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I want to use ray to submit containerized jobs to a kubernetes cluster. I’ve tried scheduling non-containerized jobs and it works fine. However, once I submit a containerized job, it is stuck in pending mode forever. The command below submits the job successfully, but is stuck forever.
ray job submit --address http://localhost:8265 --runtime-env-json='{"container": {"image": "<my-cuda-docker-image>", "worker_path": "/root"}}' -- nvidia-smi
Job submission server address: http://localhost:8265
-------------------------------------------------------
Job 'raysubmit_KKgyZumhXYm1y3ng' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_KKgyZumhXYm1y3ng
Query the status of the job:
ray job status raysubmit_KKgyZumhXYm1y3ng
Request the job to be stopped:
ray job stop raysubmit_KKgyZumhXYm1y3ng
Tailing logs until the job exits (disable with --no-wait)
Checking the job status confirms this issue.
ray job status raysubmit_KKgyZumhXYm1y3ng --address http://localhost:8265
Status for job 'raysubmit_KKgyZumhXYm1y3ng': PENDING
Status message: Job has not started yet. It may be waiting for the runtime environment to be set up.
Terminating the submitted job also does not work and basically breaks the ray cluster for me.
ray job stop raysubmit_KKgyZumhXYm1y3ng --address http://localhost:8265
Job submission server address: http://localhost:8265
Attempting to stop job 'raysubmit_KKgyZumhXYm1y3ng'
Waiting for job 'raysubmit_KKgyZumhXYm1y3ng' to exit (disable with --no-wait):
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING
Any experiences with this? Could anyone help me with this issue?