K8s cluster operator vs. the K8s cluster launcher

I’m wondering which of these is the prefered method of launching/running clusters on K8s? Are there any docs that go into more detail about differences pro/cons etc?

Hey Kyle! The operator will be the recommended way going forward :slight_smile:

We actually just merged some docs here to talk about pros/cons: Deploying on Kubernetes — Ray v2.0.0.dev0

cc @Dmitri for more questions

Thanks @rliaw exactly what I was looking for!

@rliaw What would your suggestion be for users who cannot launch CRDs into their Kubernetes cluster? Is the K8s Operator considered to more stable than the Cluster Launcher?

Hmm, can you tell me more about the context of that constraint? I would say both will have an equivalent stability level within the next two releases.

cc @Dmitri

Agree with the stability assertion.

Creating a CRD does require some form of cluster-level permissions – can you say more about the permissions context of your K8s usage?

We manage almost all of our infrastructure using Terraform. Normally I would just translate any K8s yaml into Terraform but that isn’t possible since our Terraform provider doesn’t support CRDs.

Its interesting that you’re both saying the cluster launcher and operator have equal levels of stability when one is recommended for production over the other. Can you explain in more detail why that is the case?

We’ve experienced quite a bit of ActorLostErrors in addition to GCS connection leakage when running large amounts of ray tune trials with the cluster launcher. We had hoped these issues would go away by converting to the operator, but as noted, it’s not a great fit for our orgs workflow.

Is it possible to get a stacktrace + GCS connection for this actor lost error? I think it might be separate issue related to Ray core.

The operator is a more natural interface from a K8s perspective.

But currently, the cluster launcher and operator use pretty the same code path.

The only real architectural difference is that the operator centralizes cluster launching and autoscaling in the operator’s pod,
rather than having clusters launched from a laptop and autoscaled from the Ray head node.
The operator setup is thus better for reproducible deployments that don’t depend on your local environment’s state (e.g. local Ray version).
Having the operator pod run the autoscaler is also a more reliable way to do things, as it disentangles the autoscaler’s work from the computations happening in the Ray cluster.

In the future, we’ll be making more architectural improvements to the Operator that would be more awkward to achieve in the framework of the Ray cluster launcher –
for example, compartmentalizing other Ray processes currently running on the Ray head in separate pods for better fault tolerance
and perhaps having the operator balance resources among several Ray clusters.

The errors don’t really seem to say much. I don’t know what else to do we just simply cannot get Ray to be stable.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 519, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 497, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.8/site-packages/ray/worker.py", line 1381, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.