Hi there!
I’m using the FT variant with external redis:
apiVersion: v1
items:
- apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
annotations:
meta.helm.sh/release-name: kuberay-cluster
meta.helm.sh/release-namespace: kuberay
ray.io/external-storage-namespace: ray-a
ray.io/ft-enabled: "true"
labels:
app.kubernetes.io/instance: kuberay-cluster
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kuberay
helm.sh/chart: ray-cluster-0.4.0
name: kuberay-cluster
namespace: kuberay
spec:
headGroupSpec:
rayStartParams:
block: "true"
dashboard-agent-listen-port: "52365"
dashboard-host: 0.0.0.0
metrics-export-port: "9001"
num-cpus: "0"
system-config: '''{"object_spilling_threshold": 0.99}'''
serviceType: ClusterIP
template:
metadata:
annotations: {}
labels:
app.kubernetes.io/instance: kuberay-cluster
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kuberay
helm.sh/chart: ray-cluster-0.4.0
spec:
affinity: {}
containers:
- env:
- name: RAY_BACKEND_LOG_LEVEL
value: "error"
- name: RAY_REDIS_ADDRESS
value: redis.local:6385
image: rayproject/ray:2.4.0-py39-cu113
The problem is ray log to key ray-a
in redis too much. In case of launching some number of actors under serve during on-line training I see in key store entities like this:
.... The actor is dead because it was killed by `ray.kill`....
.... Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time. ....
and so on.
If the serve actor spawns a lot of regular actor processes the entities in one key grows more than 10K per hour.
The question is how to minimize log to redis up to limit that exactly needs for FT and head recovery only?