Massive Network I/O when serve replica is unhealthy or autoscaling

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi folks, we are using Ray Serve (on a RayCluster CRD with Ray 2.7.0) and are seeing massive network IO when service is marked “unhealthy” or is autoscaling
Trying to understand what this traffic is
Our models for the serve application are stored inside our VPC but we are seeing this data flow from outside our NAT gateway
I guess to help debug we want to understand what kind of traffic we should expect when this happens, especially outbound traffic because I don’t see why there should be any outbound network traffic
(green is outbound, yellow is inbound)

This is high priority for us because it’s costing us a ton of money in networking costs

When a Ray Serve application is marked as “unhealthy” or is autoscaling, there are several operations that could potentially generate network traffic:

  1. Auto-scaling: Ray Serve’s auto-scaling capabilities dynamically adjust the number of replicas and nodes in response to user demand. This could result in an increase in network traffic as new nodes are added and data is transferred between them source.

  2. Health checks and failure recovery: If a replica or an entire Ray node crashes for any reason, the controller will detect the failures and ensure that the actors are recovered and can continue serving traffic. This process could also generate network traffic source.

  3. RayCluster preparation: If KubeRay considers a RayCluster unhealthy, it triggers a new RayCluster preparation. Once the new RayCluster is ready, KubeRay redirects network traffic to it, and subsequently deletes the old RayCluster source.

  4. Ray Serve Autoscaler vs Ray Autoscaler: The Ray Serve autoscaler asks Ray to start a number of replica actors based on the request demand. If the Ray Autoscaler determines there aren’t enough available resources to place these actors, it responds by requesting more Ray nodes. The underlying cloud provider then responds by adding more nodes source.

However, without more specific information about your setup and the nature of the network traffic, it’s difficult to say definitively what is causing the high network IO you’re observing.

You may want to monitor your Serve application using the Ray Dashboard to get more insights into the network traffic. The Ray Dashboard provides information about the Serve application deployment event and Serve application health checks source.

If you’re still having trouble identifying the source of the network traffic, you might find it helpful to check the RayService troubleshooting guide.