There are currently three ways of utilizing parallelization one is creating Actor which can be stateful and another one are creating Task and the last one being Job API.
Ray JobSubmissionClient provides job_id on submit_job job id can further be used for checking the job status in different state PENDING, RUNNING, COMPLETED, FAILED, STOPPED via JobSubmissionClient.get_job_status(job_id)
whereas in Actor and Task we create a object and executing .remote method on them give object reference and ray.get(object_refs) only waits for the final output. There is no way to check which current state in the Task and Actor only state known is pending and finish.
With JobSubmissionClient I am facing problem in putting resource constraint like in Actor and Task you have straighforward @remote(num_cpus=C, num_gpus=G) decorators.
What is the correct option to achieve the same with JobSubmissionClient? I tried job_client.submit_job( entrypoint=command, metadata={'num_cpus': '8', 'num_gpus': '1'})
but It didn’t work and also couldn’t find anything in docs or examples.
Which will be the right way to go from here as I have a executing script which I am running via subprocess in resource constraint manner on the remote machines but also want to check on the state my current process is in?
Hey @rajexp , thanks for joining the community and providing your feedback to Ray!
Before I proceed with a proposal, please bear with me for a few more questions:
Q1: Would you elaborate a bit more on what states you care about here?
Q2:So you have two(or more) machines (A, B, …), and you have a script trying to kick off multiple routines in resource constrain manner on other machines B,C,…? How does the subprocess part come into the picture?
I want to know the status of process same as provided by JobSumbmissionClient.get_job_status for my Actor/Task like PENDING, RUNNING, COMPLETED, FAILED.
Currently my task is multiple subprocess call subprocess.run([some command]) which is send out to different machines in the cluster for parallel processing. Create a job via JobSubmissionClient.submit_job I am not able to figure out how to constraint num_cpu CPU and num_gpu GPU for my job.
jobs are the not the same level as tasks and actors. It’s not a way to implement parallelism but rather let you manage and run your application in a formal way. Inside a job, you still need to use tasks and actors to do the distributed work. Currently we don’t support job level resource isolation (i.e. you cannot limit how many resources can be used at the job level). You can still specify resource requirements for the tasks and actors inside the job though.