Ray operator + client-server + autoscaling + openshift

Continuing discussion from slack, @Dmitri

1 Like

@Dmitri So the latest nightly seems to be working significantly better - the ray operator automatically spins up a head node and a worker (per minworkers 1 setting).

Even better: I added --ray-client-server-port=50051 to the head startup command, and I can connect to that head from my jupyter notebook \o/.

What it doesn’t seem to do is actually scale. I gave it a job to burn compute cycles, which ran the cluster nodes for 4 minutes or so, but it never added any more workers. A sample log fragment looks like this:

---------------------------------------------------------------
Healthy:
 1 head-node
 1 worker-nodes
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 2.0/2.0 CPU
 0.00/0.391 GiB object_store_memory
 0.00/1.172 GiB memory
 0.0/1.0 Custom1
 0.0/1.0 is_spot

Demands:
 {'CPU': 1.0}: 4728+ pending tasks/actors
example-cluster2:2021-02-06 13:25:35,245	DEBUG legacy_info_string.py:24 -- Cluster status: 1 nodes
 - MostDelayedHeartbeats: {'10.131.0.51': 0.19257760047912598, '10.131.0.50': 0.19218945503234863}
 - NodeIdleSeconds: Min=0 Mean=5 Max=10
 - ResourceUsage: 1.0/2.0 CPU, 0.0/1.0 Custom1, 0.0/1.0 is_spot, 0.0 GiB/1.26 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - worker-nodes: 1

I am still seeing it complain about not finding autoscaler service-account:

KubernetesNodeProvider: no autoscaler_service_account config provided, must already exist
KubernetesNodeProvider: no autoscaler_role config provided, must already exist
KubernetesNodeProvider: no autoscaler_role_binding config provided, must already exist
KubernetesNodeProvider: no services config provided, must already exist

In the log output, that is an INFO level message, and overall the operator clearly is doing something, but I’m assuming the above may be connected to it not actually scaling the cluster.

Thanks! We’ll take a look into the scaling problem.

The messages about service_account etc. are vestiges from reuse of code for the Ray cluster launcher – we’ll get rid of these soon.
The permissions the operator needs are set in operator.yaml

By the way, as of a recent commit, ray start starts the Ray client server automatically, by default on port 10001.
https://docs.ray.io/en/master/ray-client.html?highlight=ray%20client

@Dmitri
Just FYI…My use-case is somewhat similar as @Erik_Erlandson
In addition to the auto-scaler issue Erik mentioned
Few weeks back, I tried couple times to connect to ray head from a Jupyter pod in the same k8s cluster, couldn’t get it working
One time I got
RuntimeError: Version mismatch: The cluster was started with:
Ray: v2.0.0.dev0
Python: 3.8.5
This process on node 192.168.1.159 was started with:
Ray: v2.0.0.dev0
Python: 3.6.9

I will give it another try!

@paravatha I have jupyter → ray working (see link below)

On the version mis-match, I got that too, and I had to make sure my ray images and my jupyter notebook image are both using the same python (in my case, that is python 3.6, but as long as they’re aligned it should be ok)

1 Like

Great, will give it a try.
I don’t have admin access on OCP 4.5. so, I am trying on GKE version - 1.17.14-gke.1600

@Erik_Erlandson @Dmitri
I am now able to connect to head pod via jupyter and run notebooks.
Had to create couple of services

  1. To expose the dashboard to outside GKE
    ray/ray_dashboard_svc.yaml at dev-update-from-upstream · paravatha/ray · GitHub
  2. To expose head pod to jupyter & use ray.util.connect(‘rayhead-service:10001’)
    ray/ray_head_svc.yaml at dev-update-from-upstream · paravatha/ray · GitHub

Thanks for your help, I will play around more !

1 Like

Hi @Dmitri
have you been able to reproduce the lack of scale-up on the operator that I was seeing?

Found the problem, simple fix, just opened PR for it

@Dmitri great news :tada:

@Dmitri confirmed, the latest nightly build does auto-scaling correctly, and as before I can connect from jupyter using the client-server connection :tada:

Note, I had to update the ray-operator-role to add “services” resources, but the example in the repo is also updated same.

Super — glad it’s working for you.

The operator now configures a default service for accessing the Ray head node, hence the added permissions. We’re in the process of updating the docs.