So, I am using cluster.yaml file for cluster creation and all the servers I am using as nodes are on-premise.
Now, some of the servers are with cuda GPU, some are with i5 processor, and some are with i7 processor. So I am defining an environment variable for job capacity based on the kind of CPU or GPU, a node has.
Now, to define the environment variables at the node level, I have written a shell script in each worker node to start the worker node with defined variables, and I am calling this script in the worker node setup section in the cluster.yaml file.
Now, one of my requirements is, I want to read those environment variables in the head node, which are defined in the worker node.
So each worker is started with different env vars and you want to know the env vars for each worker from the head node? Like you want to get a map from worker node id to env vars? Is my understanding correct?
@jjyao
Actually, we have started with defining custom resources only, and it was working fine till the time. But now, for some new requirements, custom resources are restricting us to build generalised solution. With custom resources it’s possible, but it’s increasing complexity of our solution, which we don’t want. That’s why we come up with idea of using Environment variables.
So, there are two main things which defines number of jobs a node can run: 1) memory (RAM) 2) type of CPU/GPU.
Now based on our async actor implementation, our single detection actor can handle multiple jobs. So, for any node, manually we will test that how many instance of detection actor it can handle in memory and how many jobs per instance of detection actor it can handle based on computation power. Based on efficient results from testing, finally we will have four environment variables, which can define #instance & #jobs_per_instance per CPU & GPU.
Sorry for the late reply. Let’s say now you know how many instances of detection actor you can run per node, how are you going to launch different number of actors to different node?
I can find number of active instance of a actor based on list_actors() state api, where I can group actors based on node ip. Now, based on capacity I can check, if I can run new instance or not. While creating actor, I am including info in actor name itself, Ex. detactor_ip_address_GPU, which will help me for counting actor instances for CPU and GPU. Now to assign it to specific node, I am using custom resource “node:ip_address”.
@Jules_Damji I have never used NodeAffinitySchedulingStrategy, so will have to go through the functionality and how I can integrate in our pipeline flow. But still, my requirement will be still needed to access environment variables of worker node.
As @jjyao mentioned, one method is I can create a task and assigned it to the respective node, which can return values of environment variables. Still, I am looking for an easy way to do this, if possible.
Yes, That’s what I want to know. I have started going through the functionalities of NodeAffinitySchedulingStrategy, it will take time for me to update complete flow of my pipeline.
But, still with this also, it’s not solving my complete requirement which I have mentioned in below answer:
Here, I have mentioned the use case of my environment variables:
Ray doesn’t support getting environment variables of worker nodes so you need to do your own thing. One possibility is launching a task to each worker node to collect that.