Ray operator + client-server + autoscaling + openshift

Erik_Erlandson · February 5, 2021, 6:54pm

Continuing discussion from slack, @Dmitri

Erik_Erlandson · February 6, 2021, 9:37pm

@Dmitri So the latest nightly seems to be working significantly better - the ray operator automatically spins up a head node and a worker (per minworkers 1 setting).

Even better: I added --ray-client-server-port=50051 to the head startup command, and I can connect to that head from my jupyter notebook \o/.

What it doesn’t seem to do is actually scale. I gave it a job to burn compute cycles, which ran the cluster nodes for 4 minutes or so, but it never added any more workers. A sample log fragment looks like this:

---------------------------------------------------------------
Healthy:
 1 head-node
 1 worker-nodes
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 2.0/2.0 CPU
 0.00/0.391 GiB object_store_memory
 0.00/1.172 GiB memory
 0.0/1.0 Custom1
 0.0/1.0 is_spot

Demands:
 {'CPU': 1.0}: 4728+ pending tasks/actors
example-cluster2:2021-02-06 13:25:35,245	DEBUG legacy_info_string.py:24 -- Cluster status: 1 nodes
 - MostDelayedHeartbeats: {'10.131.0.51': 0.19257760047912598, '10.131.0.50': 0.19218945503234863}
 - NodeIdleSeconds: Min=0 Mean=5 Max=10
 - ResourceUsage: 1.0/2.0 CPU, 0.0/1.0 Custom1, 0.0/1.0 is_spot, 0.0 GiB/1.26 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - worker-nodes: 1

I am still seeing it complain about not finding autoscaler service-account:

KubernetesNodeProvider: no autoscaler_service_account config provided, must already exist
KubernetesNodeProvider: no autoscaler_role config provided, must already exist
KubernetesNodeProvider: no autoscaler_role_binding config provided, must already exist
KubernetesNodeProvider: no services config provided, must already exist

In the log output, that is an INFO level message, and overall the operator clearly is doing something, but I’m assuming the above may be connected to it not actually scaling the cluster.

Dmitri · February 6, 2021, 10:02pm

Thanks! We’ll take a look into the scaling problem.

The messages about service_account etc. are vestiges from reuse of code for the Ray cluster launcher – we’ll get rid of these soon.
The permissions the operator needs are set in operator.yaml

By the way, as of a recent commit, ray start starts the Ray client server automatically, by default on port 10001.
https://docs.ray.io/en/master/ray-client.html?highlight=ray%20client

paravatha · February 7, 2021, 8:58pm

@Dmitri
Just FYI…My use-case is somewhat similar as @Erik_Erlandson
In addition to the auto-scaler issue Erik mentioned
Few weeks back, I tried couple times to connect to ray head from a Jupyter pod in the same k8s cluster, couldn’t get it working
One time I got
RuntimeError: Version mismatch: The cluster was started with:
Ray: v2.0.0.dev0
Python: 3.8.5
This process on node 192.168.1.159 was started with:
Ray: v2.0.0.dev0
Python: 3.6.9

I will give it another try!

Erik_Erlandson · February 7, 2021, 9:04pm

@paravatha I have jupyter → ray working (see link below)

On the version mis-match, I got that too, and I had to make sure my ray images and my jupyter notebook image are both using the same python (in my case, that is python 3.6, but as long as they’re aligned it should be ok)

paravatha · February 7, 2021, 9:06pm

Great, will give it a try.
I don’t have admin access on OCP 4.5. so, I am trying on GKE version - 1.17.14-gke.1600

paravatha · February 8, 2021, 1:12am

@Erik_Erlandson @Dmitri
I am now able to connect to head pod via jupyter and run notebooks.
Had to create couple of services

To expose the dashboard to outside GKE
ray/ray_dashboard_svc.yaml at dev-update-from-upstream · paravatha/ray · GitHub
To expose head pod to jupyter & use ray.util.connect(‘rayhead-service:10001’)
ray/ray_head_svc.yaml at dev-update-from-upstream · paravatha/ray · GitHub

Thanks for your help, I will play around more !

Erik_Erlandson · February 9, 2021, 3:03pm

Hi @Dmitri
have you been able to reproduce the lack of scale-up on the operator that I was seeing?

Dmitri · February 10, 2021, 2:41am

Found the problem, simple fix, just opened PR for it

Erik_Erlandson · February 10, 2021, 3:27pm

@Dmitri great news

Erik_Erlandson · February 10, 2021, 11:50pm

@Dmitri confirmed, the latest nightly build does auto-scaling correctly, and as before I can connect from jupyter using the client-server connection

Note, I had to update the ray-operator-role to add “services” resources, but the example in the repo is also updated same.

Dmitri · February 11, 2021, 12:09am

Super — glad it’s working for you.

The operator now configures a default service for accessing the Ray head node, hence the added permissions. We’re in the process of updating the docs.

Topic		Replies	Views
Autoscaler issues with the K8 Operator Kubernetes	8	667	March 2, 2021
Autoscaler does not seem to watch head node Kubernetes	5	736	March 26, 2021
Testing autoscaler Kubernetes	15	1547	March 16, 2021
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	726	November 20, 2023
Autoscaler not scaling up the worker node when using image rayproject/ray:1.11.0-py38 Kubernetes	3	900	July 2, 2022

Ray operator + client-server + autoscaling + openshift

Related topics