I am reading ray 1.0 architecture document and I had few questions.
Question 1:
Under the heading Component
, it mentions
A Ray instance consists of one or more worker nodes
Does that mean at minimum just a head node? or does that mean at minimum one head node and one worker node?
Question 2:
The default number of initial workers is equal to the number of CPUs on the machine
Is this referring to the number of workers
within a worker node or is this referring to the total number of worker nodes
? i.e if I have 4 cores will I have 4 worker nodes or one worker node with 4 workers?
Figure for question 2
Question 3:
Each worker stores: an ownership table and in-process store
Is there one ownership table and one in-process store per worker node
or per worker within the worker node
?
Figure for question 3
Question 4:
A raylet. The raylet is shared among all jobs on the same cluster.
Does that mean the Raylet: Scheduler and Object store
is the same for the head node and all other worker nodes? Or does each head node and worker node have their own individual Raylet: Scheduler and Object store
.
The reason I am asking is under Failure modes
it mentions All worker processes fate-share with the raylet on their node
. This is confusing to me since now I do not know if the raylet is shared amongst the whole cluster (all head nodes and worker nodes) or shared within the node itself being it head node or worker node.
Figure for question 4
or
Figure for question 4
Question 5:
Task execution for actors is similar to that of normal tasks
In the above sentence, normal tasks
mentions “heading no longer exists” can you please fix the reference?
Thanks in advance
Does that mean at minimum just a head node? or does that mean at minimum one head node and one worker node?
Minimum is just a head node!
Is this referring to the number of workers
within a worker node or is this referring to the total number of worker nodes
? i.e if I have 4 cores will I have 4 worker nodes or one worker node with 4 workers?
Yeah this is confusing haha… It means “worker processes”. So there are num_cpus
number of worker processes. Note that when you start Ray using ray start
, it doesn’t start worker processes by default, and they are started when a driver (python script) needs them.
Is there one ownership table and one in-process store per worker node
or per worker within the worker node
?
Per worker process!
Does that mean the Raylet: Scheduler and Object store
is the same for the head node and all other worker nodes? Or does each head node and worker node have their own individual Raylet: Scheduler and Object store
.
It’s the latter! Ray’s scheduler is decentralized. And the object store is distributed.
In the above sentence, normal tasks
mentions “heading no longer exists” can you please fix the reference?
Will fix!
2 Likes
Thanks, @sangcho for the reply. This clears out some thoughts. I had follow-up questions if you don’t mind and have the time.
Question 1:
Minimum is just a head node!
Does that head node contain “1 driver and 1 worker process at a minimum” or only “one driver”?
Question 2:
It means “worker processes”. So there are num_cpus
number of worker processes.
2.a - Theoretically speaking, if I have 8 cores then I will have 8 worker processes. Are these worker processes within the head node or within the worker node?
2.b - What controls the number of worker nodes if the number of CPUs controls the worker processes?
2.c - What decides the distribution of worker processes within the worker nodes? i.e will one worker node have 2 processes while the other worker nodes have 3 processes or all worker nodes always have the same number of worker processes?
2.d - Do worker processes within the worker node depend on each other returns or all worker processes within the work node are in parallel and any dependencies is from another worker node?
Question 3:
It’s the latter! Ray’s scheduler is decentralized. And the object store is distributed.
So does that mean that the following sentence in the document is wrong “A raylet. The raylet is shared among all jobs on the same cluster.”. Should it be " The raylet is shared within a node (being it a worker node or a head node). Or in this case, a cluster refers to a node?
Thanks again! And sorry for the barrage of questions. This is quite exciting to understand the technology behind Ray And hopefully, I can contribute to that document if possible in a later stage.
Does that head node contain “1 driver and 1 worker process at a minimum” or only “one driver”?
Head node doesn’t need to have worker processes. For example, one of pattern that some people use is to use 0 cpu on a head node (so none of ray worker processes will start).
For 2.a~d
I recommend you to look at this doc to understand how to deploy Ray. I think you can answer all questions after you understand this part. Please follow up with me if there are still unknowns!
https://docs.ray.io/en/latest/cluster/index.html
So does that mean that the following sentence in the document is wrong “ A raylet. The raylet is shared among all jobs on the same cluster. ”.
A Ray instance consists of one or more worker nodes, each of which consists of the following physical processes:
Notet this sentence that is above the sentence you are referring to.
Also, I think “job” is ambiguous sentence here in the white paper. I think what it meant is “work”, (e.g., tasks and actors).
Also, no problem about asking many questions :). Happy to answer them!
1 Like