Running the head node as an ECS service

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.49
  • Python version: 3.12
  • OS: AWS Linux
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant): ECS on EC2

3. What we are trying to do:

We are investigating if it’s possible to run a Ray cluster where the headnode is spun-up as an ECS service, which then auto-scales workers as usual; by creating EC2 instances.

Instead of using ray up, we are trying starting the headnode-taks with:

ray start --head --dashboard-host=0.0.0.0 --port=6379 --dashboard-port=80 --disable-usage-stats --autoscaling-config ray_bootstrap_config.yaml

The ray_bootstrap_config.yaml, we pulled from one of our other clusters and modified it a bit to align with the environment we are trying to do this in.

4. What is happening:
Not much… It all starts without errors, but the cluster isn’t coming up. I can see the following:

$ ray status:

No cluster status. It may take a few seconds for the Ray internal services to start up.

$ ray cluster-dump ray_bootstrap_config.yaml:

2025-10-30 08:21:47,624 WARN commands.py:1569 -- You are about to create a cluster dump. This will collect data from cluster nodes.

The dump will contain this information:

  - The logfiles of your Ray session
    This usually includes Python outputs (stdout/stderr)
  - Debug state information on your Ray cluster
    e.g. number of workers, drivers, objects, etc.
  - Your installed Python packages (`pip freeze`)
  - Information on your running Ray processes
    This includes command line arguments

If you are concerned about leaking private information, extract the archive and inspect its contents before sharing it with anyone.
2025-10-30 08:21:47,624 INFO cluster_dump.py:563 -- Retrieving cluster information from ray cluster file: ray_bootstrap_config.yaml
2025-10-30 08:21:47,912 INFO commands.py:389 -- Checking AWS environment settings
2025-10-30 08:21:47,913 VINFO utils.py:149 -- Creating AWS resource `ec2` in `eu-west-1`

And that’s basically the state it stays in until I kill it.

5. Questions:

  1. How can I get some more information on what the auto-scaler is trying to do? Are there other logs I can inspect, or documentation on something similar I can look into? Any tips?
  2. Is this even an approach worth pursuing? I can spends more time trying to make this work, but if the hive-mind tells me it’s a stupid idea I will go back to the drawing board :slight_smile:

Thx!

Running a Ray head node as an ECS service and expecting it to autoscale EC2 worker nodes is not a standard or documented Ray deployment pattern. The Ray autoscaler expects to manage both head and worker nodes, typically via ray up or by running the head node on an EC2 instance with the correct IAM permissions and network access. When running the head node in ECS, the autoscaler may not have the necessary permissions or network visibility to launch and manage EC2 worker nodes, which can result in the cluster not coming up as you’ve observed. According to the Ray community, this approach is not commonly supported or recommended, and similar issues have been reported when deviating from the documented cluster launcher workflows (discuss.ray.io, github.com/ray-project/ray/issues/39565).

For debugging, you can inspect logs on the head node (e.g., /tmp/ray/session_latest/logs/monitor*, cluster_monitor.log, and cluster_monitor.err) to see what the autoscaler is attempting. The ray cluster-dump command can also collect logs and debug state, but if the autoscaler is not able to launch or connect to workers, these logs may not be very informative. The recommended approach is to use the standard Ray cluster launcher (ray up) on an EC2 instance, not ECS, to ensure the autoscaler can manage the full lifecycle of both head and worker nodes (discuss.ray.io). Would you like more detail on how to set up Ray clusters on AWS using the supported methods?

Sources:

Hint: Mention @RunLLM in the post for followups.