How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’d like to access metadata that is attached to a ray job from the machine/pod in k8s etc. Currently I’m using this example to submit my model training job.
from ray.job_submission import JobSubmissionClient
# If using a remote cluster, replace 127.0.0.1 with the head node's IP address.
client = JobSubmissionClient("http://127.0.0.1:8265")
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint="python driver.py",
# Path to the local directory that contains the driver.py file
runtime_env={"working_dir": "./"}
)
print(job_id)
However, all the info related to the job lives inside the driver.py which runs on ray head node. My use case is to get the metadata associated to the job from the machine which runs this submit_job function.
Would love it if ray allows users to update metadata of a job and expose that to the python client.
@Y_C If I understand you correctly, you want some worker related meta-data (as part of job_id). What else do you need from the machine where the job is submitted to return to you?
Does the driver.py has tasks which will be scheduled on remote workers? And do you want any information back. If so, you can have the task return a dict of the related info. Alternatively, you can have each task scheduled put metadata into the object store, which then thten the driver.py can finally fetch it as all aggregated meta-data from which task.
If you want to update the metadata from inside the entrypoint script, currently we don’t have an API for that, but if this would be useful it would be great if you could file an enhancement request on the Ray github! Sign in to GitHub · GitHub
Our use case is to collect the metadata generated during training like user defined metrics and traceback for errors during training to the pod/machine where the job is submitted. In our case it’s another pod on the same k8s cluster where ray cluster is deployed. @Jules_Damji driver code is ran on ray head but not on the machine/pod where I ran the client.submit_job script shown above. And attaching metadata doesn’t work for me because I need to be able to update the metadata from inside the entrypoint script.