K8s cluster operator vs. the K8s cluster launcher

hamlinkn · March 1, 2021, 8:31pm

I’m wondering which of these is the prefered method of launching/running clusters on K8s? Are there any docs that go into more detail about differences pro/cons etc?

rliaw · March 2, 2021, 5:36am

Hey Kyle! The operator will be the recommended way going forward

We actually just merged some docs here to talk about pros/cons: Deploying on Kubernetes — Ray v2.0.0.dev0

cc @Dmitri for more questions

hamlinkn · March 3, 2021, 4:25pm

Thanks @rliaw exactly what I was looking for!

hamlinkn · March 17, 2021, 9:31pm

@rliaw What would your suggestion be for users who cannot launch CRDs into their Kubernetes cluster? Is the K8s Operator considered to more stable than the Cluster Launcher?

rliaw · March 19, 2021, 2:26am

Hmm, can you tell me more about the context of that constraint? I would say both will have an equivalent stability level within the next two releases.

cc @Dmitri

Dmitri · March 19, 2021, 3:22am

Agree with the stability assertion.

Creating a CRD does require some form of cluster-level permissions – can you say more about the permissions context of your K8s usage?

hamlinkn · March 22, 2021, 4:53pm

We manage almost all of our infrastructure using Terraform. Normally I would just translate any K8s yaml into Terraform but that isn’t possible since our Terraform provider doesn’t support CRDs.

Its interesting that you’re both saying the cluster launcher and operator have equal levels of stability when one is recommended for production over the other. Can you explain in more detail why that is the case?

We’ve experienced quite a bit of ActorLostErrors in addition to GCS connection leakage when running large amounts of ray tune trials with the cluster launcher. We had hoped these issues would go away by converting to the operator, but as noted, it’s not a great fit for our orgs workflow.

rliaw · March 23, 2021, 8:16am

Is it possible to get a stacktrace + GCS connection for this actor lost error? I think it might be separate issue related to Ray core.

Dmitri · March 24, 2021, 12:21am

The operator is a more natural interface from a K8s perspective.

But currently, the cluster launcher and operator use pretty the same code path.

The only real architectural difference is that the operator centralizes cluster launching and autoscaling in the operator’s pod,
rather than having clusters launched from a laptop and autoscaled from the Ray head node.
The operator setup is thus better for reproducible deployments that don’t depend on your local environment’s state (e.g. local Ray version).
Having the operator pod run the autoscaler is also a more reliable way to do things, as it disentangles the autoscaler’s work from the computations happening in the Ray cluster.

In the future, we’ll be making more architectural improvements to the Operator that would be more awkward to achieve in the framework of the Ray cluster launcher –
for example, compartmentalizing other Ray processes currently running on the Ray head in separate pods for better fault tolerance
and perhaps having the operator balance resources among several Ray clusters.

hamlinkn · April 7, 2021, 8:35pm

The errors don’t really seem to say much. I don’t know what else to do we just simply cannot get Ray to be stable.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 519, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 497, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.8/site-packages/ray/worker.py", line 1381, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Topic		Replies	Views
Highly available head node? Kubernetes	3	1253	March 29, 2021
(k8s) Ray Operator + Ray Client example seems to not use all pods Kubernetes	1	442	June 25, 2021
Fail to launch ray cluster after upgrade to 2.0.1 Kubernetes	2	429	November 18, 2022
Unable to create cluster in kubernetes namespace Kubernetes	6	1313	March 29, 2021
Some questions about Ray on Kubernetes Ray Clusters	3	772	December 3, 2021

K8s cluster operator vs. the K8s cluster launcher

Related topics