How to prevent unnecessary writes to Redis in case of FT?

jamm1985 · May 9, 2023, 12:19pm

Hi there!

I’m using the FT variant with external redis:

apiVersion: v1
items:
- apiVersion: ray.io/v1alpha1
  kind: RayCluster
  metadata:
    annotations:
      meta.helm.sh/release-name: kuberay-cluster
      meta.helm.sh/release-namespace: kuberay
      ray.io/external-storage-namespace: ray-a
      ray.io/ft-enabled: "true"
    labels:
      app.kubernetes.io/instance: kuberay-cluster
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: kuberay
      helm.sh/chart: ray-cluster-0.4.0
    name: kuberay-cluster
    namespace: kuberay

  spec:
    headGroupSpec:
      rayStartParams:
        block: "true"
        dashboard-agent-listen-port: "52365"
        dashboard-host: 0.0.0.0
        metrics-export-port: "9001"
        num-cpus: "0"
        system-config: '''{"object_spilling_threshold": 0.99}'''
      serviceType: ClusterIP
      template:
        metadata:
          annotations: {}
          labels:
            app.kubernetes.io/instance: kuberay-cluster
            app.kubernetes.io/managed-by: Helm
            app.kubernetes.io/name: kuberay
            helm.sh/chart: ray-cluster-0.4.0
        spec:
          affinity: {}
          containers:
          - env:
            - name: RAY_BACKEND_LOG_LEVEL
              value: "error"
            - name: RAY_REDIS_ADDRESS
              value: redis.local:6385
            image: rayproject/ray:2.4.0-py39-cu113

The problem is ray log to key ray-a in redis too much. In case of launching some number of actors under serve during on-line training I see in key store entities like this:

.... The actor is dead because it was killed by `ray.kill`....
.... Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time. ....

and so on.

If the serve actor spawns a lot of regular actor processes the entities in one key grows more than 10K per hour.

The question is how to minimize log to redis up to limit that exactly needs for FT and head recovery only?

Topic		Replies	Views
Deploying Static Ray clusters with Kubernetes Kubernetes	0	434	November 17, 2023
Unable to recover from head-pod failure in k8s Ray Clusters	8	826	March 22, 2022
Job API is very slow when using external redis	3	326	September 26, 2023
How does Ray actor work? Kubernetes	0	24	April 2, 2025
How to recover job data when using ray service to restart the ray cluster Kubernetes	1	547	June 5, 2023

How to prevent unnecessary writes to Redis in case of FT?

Related topics