Actor not being created randomly due to missing resource

  • High: It blocks me to complete my task.

My use case in this test is as follows
1 Head node with no resource definition
1 Worker node with resource definition as follows: name 0-0, starting value: 9990000
I am using this to place actor on node with specific HW. I know there is a new api to place actor on a specific node id but I haven’t figured out how to give node id to node at ray start so haven’t moved to it yet. So if someone could tell me how to do so it would solve my issue as well

I am running 4 actors on the worker node
most of the time everything is fine but every so often between 1 to 4 actors won’t be created

I am running the two nodes on a manual kubernetes cluster
I turned on RAY_LOG_TO_STDERR to get the logs

I saw that in the head node when there is a problem I see the following error
“cannot be scheduled right now. It requires {0-0: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.”

But I set the resource to a very large number so I can’t understand why that would happen.

I have the head log and worker log from a bad run and a good run

Here is a link to 4 log files,

for head and worker log, 1 set is with bad, as in not all actors are created and the other with good means my test is running
If there is a better way to upload logs, let me know

Thanks
Shiran

hi @shiranbi

Sounds like NodeAffinitySchedulingStrategy is the perfect fit for your use case. Scheduling — Ray 3.0.0.dev0 is the doc and also examples on how to use it. Would this solve your problem?

Hi @Chen_Shen
I would gladly move to node affinity scheduling if I could understand how I can assign node id to the node when I do ray start. Or any other way to have a mapping between the ray node that I start and the node id it is given
The example doesn’t show this.

hi @shiranbi,
you can start the node with --node-name=$your-prefered-node-name, which by default is the ip-addres.
Then, you can use following script to find the node_id

for node in ray.nodes
    if node['NodeName'] == name:
        node_id=node['NodeID']

Hi @Chen_Shen
Thanks, implemented and will see now if issue was indeed from resource bug
I was originally hesitant to use the ray.nodes api since in the documentation it says that it is for debugging only,

1 Like

@shiranbi
That’s fair. Let’s know how it works, and it’s clear there is a gap in Ray’s API for node affinity scheduling. cc @jjyao @sangcho

1 Like

@Chen_Shen
For now it is working fine.
I have an issue where sometimes ray.nodes doesn’t return all nodes immediately (and doing sleep and trying again will work) but my guess it is in how I create my cluster and not ray.nodes so I opened another question on that Setup kubernetes probes

Should I open a feature request to get a proper api for this?

@shiranbi,

yes, could you file a feature request Github issue? Thanks!