Error while starting first job after cluster creation

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

After cluster creation, when I start first job, I could not find logs in dashboard logs section with job_submission_id. It’s not starting job also. I am attaching some screenshots which can be helpful to analyze the issue.


For that specific actor:

I am using cluster.yaml file with docker image for cluster creation. I could not reproduce this error again. This is happening for first job only, second jobs onwards all the jobs are running perfectly.

When you start a cluster again, were you able to see this? Wonder it is some really rare race condition. cc @architkulkarni for input

No. When I restart the cluster, it’s working fine. But during testing, second time I faced the same issue.

Sorry you’re running into this, I haven’t seen this particular error before. It might be related to a race condition which was fixed recently. You could try using the Ray nightly builds, and if the error happens again it would be great if you could post these details in an issue on the Ray github.

Out of curiosity are you submitting the jobs via the SDK, CLI or REST API? And are you submitting it with the optional parameter submission_id, or without it?

Lastly, if you could share the contents of dashboard_agent.log from the head node when this happens, that would be helpful.

As of now, we have deployed our pipeline in production and working on new feature testing. so at this stage, we can’t change ray version during testing also. We are using ray v2.3.0.

I am submitting the jobs using SDK and without submission_id parameter.

As of now, I have restarted my cluster and as I am using docker container, logs are not there. Next time, I will make sure to collect logs from mentioned file.