Running a Ray head node as an ECS service and expecting it to autoscale EC2 worker nodes is not a standard or documented Ray deployment pattern. The Ray autoscaler expects to manage both head and worker nodes, typically via ray up or by running the head node on an EC2 instance with the correct IAM permissions and network access. When running the head node in ECS, the autoscaler may not have the necessary permissions or network visibility to launch and manage EC2 worker nodes, which can result in the cluster not coming up as you’ve observed. According to the Ray community, this approach is not commonly supported or recommended, and similar issues have been reported when deviating from the documented cluster launcher workflows (discuss.ray.io, github.com/ray-project/ray/issues/39565).
For debugging, you can inspect logs on the head node (e.g., /tmp/ray/session_latest/logs/monitor*, cluster_monitor.log, and cluster_monitor.err) to see what the autoscaler is attempting. The ray cluster-dump command can also collect logs and debug state, but if the autoscaler is not able to launch or connect to workers, these logs may not be very informative. The recommended approach is to use the standard Ray cluster launcher (ray up) on an EC2 instance, not ECS, to ensure the autoscaler can manage the full lifecycle of both head and worker nodes (discuss.ray.io). Would you like more detail on how to set up Ray clusters on AWS using the supported methods?
Sources:
Hint: Mention @RunLLM in the post for followups.