AWS bills can escalate quickly as applications scale. While reserved instances and savings plans offer discounts, they require long-term commitments. For dynamic, scaling application services, EC2 or Fargate Spot Instances are a powerful alternative—offering up to a 90% discount compared to On-Demand pricing.

However, running Spot in production comes with a major catch: AWS can terminate your instance with a brief 2-minute warning if they need the capacity back. In this guide, we'll design a hybrid ECS Capacity Provider strategy that blends On-Demand and Spot instances to maintain uptime while reducing costs.

Cost-Efficiency Goal: Run baseline critical services on On-Demand, and direct all dynamic horizontal scaling capacity to Spot instances.

The Strategy: Capacity Providers

ECS Capacity Providers allow us to define rules for how tasks are placed. We will use the following strategies:

  • Base: The minimum number of tasks that must run on a capacity provider. We'll set a base of 2 on On-Demand to guarantee that we always have at least two containers running, even if Spot capacity is fully reclaimed by AWS.
  • Weight: The relative proportion of tasks launched on each provider once the base is satisfied. We will use a 1:3 ratio (1 On-Demand for every 3 Spot tasks).

Step 1: Terraform Capacity Provider Configuration

Let's codify this strategy using Terraform. We'll configure an ECS cluster that utilizes both FARGATE and FARGATE_SPOT capacity providers.

ecs.tf (Cluster Configuration)
# Create the main ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "production-cluster"
}

# Associate Capacity Providers with the Cluster
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    base              = 2
    weight            = 1
  }

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    base              = 0
    weight            = 3
  }
}

Step 2: Assigning Strategy to ECS Service

When launching our application service, we reference our cluster's capacity provider strategy. This ensures that as the service scales out (e.g. from 2 to 10 tasks), the tasks are placed according to our rules.

service.tf (Service Definition)
resource "aws_ecs_service" "api" {
  name            = "production-api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 8

  network_configuration {
    subnets         = ["subnet-xxxx", "subnet-yyyy"]
    security_groups = ["sg-xxxx"]
  }

  # Define capacity strategy overrides
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    base              = 2
    weight            = 1
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    base              = 0
    weight            = 3
  }
}

Handling Interruption: Graceful Shutdowns

Because Spot instances can be reclaimed at any time, your applications must be stateless and handle shutdowns gracefully. When AWS schedules a Spot instance for termination, ECS receives an interruption warning.

To ensure active connections aren't dropped, configure these settings in your containers:

  1. Increase deregistration delay: Set your target group's deregistration_delay.timeout_seconds to 30-60 seconds. This stops the load balancer from sending new requests to the terminating container while it finishes processing active connections.
  2. Configure container stop timeout: Set stopTimeout in your ECS container definition to 30 seconds. This gives your application process time to handle the kernel's SIGTERM signal, complete open requests, close database connections, and exit cleanly before receiving a SIGKILL.

Result & Savings Metrics

By implementing this strategy on our API tier services, we saw the following outcome:

cost-audit-report.json
{
  "cluster": "production-cluster",
  "metrics": {
    "pre_migration_cost_monthly": "$4,200.00",
    "post_migration_cost_monthly": "$2,730.00",
    "savings_percentage": "35%",
    "interruption_replacement_avg_seconds": "42s",
    "interruption_dropped_requests": "0"
  }
}

Conclusion

Combining ECS Capacity Providers with Spot instances allows you to scale cost-effectively. By automating the placement ratio with Terraform and structuring your services to exit gracefully on SIGTERM, you can run workloads at a fraction of the cost without compromising production availability.

Review your scaling limits and apply this strategy to your dev, staging, and stateless production tiers!