Actor not being created randomly due to missing resource

shiranbi · July 24, 2022, 2:01pm

High: It blocks me to complete my task.

My use case in this test is as follows
1 Head node with no resource definition
1 Worker node with resource definition as follows: name 0-0, starting value: 9990000
I am using this to place actor on node with specific HW. I know there is a new api to place actor on a specific node id but I haven’t figured out how to give node id to node at ray start so haven’t moved to it yet. So if someone could tell me how to do so it would solve my issue as well

I am running 4 actors on the worker node
most of the time everything is fine but every so often between 1 to 4 actors won’t be created

I am running the two nodes on a manual kubernetes cluster
I turned on RAY_LOG_TO_STDERR to get the logs

I saw that in the head node when there is a problem I see the following error
“cannot be scheduled right now. It requires {0-0: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.”

But I set the resource to a very large number so I can’t understand why that would happen.

I have the head log and worker log from a bad run and a good run

Here is a link to 4 log files,

for head and worker log, 1 set is with bad, as in not all actors are created and the other with good means my test is running
If there is a better way to upload logs, let me know

Thanks
Shiran

Chen_Shen · July 24, 2022, 9:32pm

hi @shiranbi

Sounds like NodeAffinitySchedulingStrategy is the perfect fit for your use case. Scheduling — Ray 3.0.0.dev0 is the doc and also examples on how to use it. Would this solve your problem?

shiranbi · July 25, 2022, 4:03am

Hi @Chen_Shen
I would gladly move to node affinity scheduling if I could understand how I can assign node id to the node when I do ray start. Or any other way to have a mapping between the ray node that I start and the node id it is given
The example doesn’t show this.

Chen_Shen · July 25, 2022, 5:14am

hi @shiranbi,
you can start the node with --node-name=$your-prefered-node-name, which by default is the ip-addres.
Then, you can use following script to find the node_id

for node in ray.nodes
    if node['NodeName'] == name:
        node_id=node['NodeID']

shiranbi · July 25, 2022, 6:44am

Hi @Chen_Shen
Thanks, implemented and will see now if issue was indeed from resource bug
I was originally hesitant to use the ray.nodes api since in the documentation it says that it is for debugging only,

Chen_Shen · July 26, 2022, 5:33am

@shiranbi
That’s fair. Let’s know how it works, and it’s clear there is a gap in Ray’s API for node affinity scheduling. cc @jjyao @sangcho

shiranbi · August 28, 2022, 10:28am

@Chen_Shen
For now it is working fine.
I have an issue where sometimes ray.nodes doesn’t return all nodes immediately (and doing sleep and trying again will work) but my guess it is in how I create my cluster and not ray.nodes so I opened another question on that Setup kubernetes probes

Should I open a feature request to get a proper api for this?

jjyao · August 30, 2022, 4:48pm

@shiranbi,

yes, could you file a feature request Github issue? Thanks!

Topic		Replies	Views
Understanding resource requirement for tasks and actors Ray Core	1	314	July 17, 2023
How to: ensure actor is running on the same node only? Ray Core	13	1829	May 13, 2021
Actor placement and execution resources Ray Core	8	364	December 12, 2023
Setup of kubernetes probes Ray Clusters	2	585	August 31, 2022
Creating actors when their amount is more than `num_cpus` Ray Core	8	4310	April 29, 2021

Actor not being created randomly due to missing resource

Related topics