Prometheus is the de-facto open-source monitoring standard for cloud-native environments. If you work with AWS ECS, Kubernetes, or any distributed system, you will encounter it. In this blog, we tear down every component of the Prometheus architecture โ€” visually, with real-world ECS context โ€” so you walk away truly understanding how the pieces fit together.

In this blog, you will learn the following:

  • What is Prometheus Architecture?
  • Prometheus Server
  • Time-Series Database (TSDB)
  • Prometheus Targets
  • Prometheus Exporters
  • Prometheus Service Discovery
  • Prometheus Pushgateway
  • Prometheus Client Libraries
  • Prometheus Alert Manager
  • PromQL

Prometheus Architecture โ€” ECS Flow

The diagram below shows how all Prometheus components interact in a typical AWS ECS environment. Watch the data flows between components.

Prometheus Architecture โ€” ECS Flow
Batch Workloads (Cron Jobs) Pushgateway short-lived jobs ECS Tasks (Apps & Exporters) /metrics Service Discovery Consul ECS SD File SD Prometheus Server Metrics Retrieval TSDB HTTP Server Local Storage AlertManager (Alerting) Grafana (Web UI & API) Email Slack Teams push metrics scrape discover targets push alerts PromQL
ECS Tasks
Batch & Exporters
Service Discovery / Grafana
Prometheus Core
AlertManager
โ”€โ”€ animated flow  ยท  - - = logical

1. What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud, now a graduated CNCF project. It is written in Go and follows a pull-based metrics collection model โ€” instead of agents pushing data to a central server, Prometheus actively scrapes (HTTP GET) the /metrics endpoint of targets at a regular interval.

Key idea: Prometheus stores everything as time-series data โ€” a stream of timestamped values identified by a metric name and key-value labels. e.g. http_requests_total{method="GET", status="200"}

2. Prometheus Server

The Prometheus server is the heart of the system. It has three internal sub-components that work together:

๐Ÿ”
Retrieval Engine
Performs HTTP scrapes against all configured targets at a defined interval (default: 15 seconds). Handles service discovery, relabelling, and authentication.
๐Ÿ—„๏ธ
TSDB Storage
Stores scraped metrics in a custom time-series database on local disk. Uses a WAL (Write-Ahead Log) for durability and 2-hour chunks for efficient range reads.
๐ŸŒ
HTTP API Server
Exposes a PromQL query interface on port :9090. Grafana and other tools use this endpoint to run queries and render dashboards.

3. Time-Series Database (TSDB)

Prometheus's built-in TSDB is purpose-built for time-series workloads. Unlike relational databases, it stores data in chunks of 2-hour blocks that are compacted over time to save disk space.

1
Ingestion: Scraped samples are appended to an in-memory WAL and 2-hour memory-mapped chunks.
2
Compaction: Every 2 hours, chunks are flushed to disk as immutable block directories with an index for fast lookups.
3
Retention: Default 15 days. For long-term storage, use the remote-write API to push to Thanos or Cortex.
prometheus data directory layout
# /prometheus/data/
.
โ”œโ”€โ”€ 01H8G3...  # 2-hour block (immutable)
โ”‚   โ”œโ”€โ”€ chunks/
โ”‚   โ”‚   โ””โ”€โ”€ 000001
โ”‚   โ”œโ”€โ”€ index
โ”‚   โ””โ”€โ”€ meta.json
โ”œโ”€โ”€ 01H8G4...  # another block
โ”œโ”€โ”€ wal/        # Write-Ahead Log (current)
โ”‚   โ”œโ”€โ”€ 00000001
โ”‚   โ””โ”€โ”€ checkpoint.000005/
โ””โ”€โ”€ lock

4. Prometheus Targets

A target is any endpoint that exposes metrics in the Prometheus text format at a /metrics HTTP path. In an ECS environment, targets can be:

  • ECS Tasks โ€” your microservices instrumented with a client library
  • Node Exporter โ€” running as a sidecar or daemon on the EC2 host
  • cAdvisor โ€” container-level CPU, memory, and network metrics
  • Custom exporters โ€” e.g., ECS task metadata endpoint wrapped as a Prometheus target
  • AWS services โ€” via yet-another-cloudwatch-exporter (YACE)

5. Prometheus Exporters

Exporters are adapter processes that translate metrics from systems that don't natively speak the Prometheus format into something Prometheus can scrape. They run alongside the target and expose a /metrics endpoint.

๐Ÿ–ฅ๏ธ
node_exporter
Hardware and OS metrics from Linux hosts. CPU, memory, disk I/O, network โ€” the gold standard for ECS EC2 host monitoring.
๐Ÿณ
cAdvisor
Container resource usage per task in ECS. Runs as a daemon or sidecar. Reports per-container CPU/memory/network.
๐Ÿ—ƒ๏ธ
mysqld_exporter
MySQL / RDS performance metrics: queries/sec, connections, InnoDB buffer pool hit rate, replication lag.
โ˜๏ธ
YACE (CloudWatch)
Bridges AWS CloudWatch metrics (ALB, RDS, SQS, DynamoDB) into Prometheus. Ideal for full ECS stack visibility.
๐Ÿ”ด
redis_exporter
Redis / ElastiCache metrics: connected clients, used memory, hit/miss ratio, replication offset.
๐ŸŒ
blackbox_exporter
Probes external endpoints over HTTP, HTTPS, DNS, TCP. Great for uptime monitoring and SSL expiry alerting.

6. Prometheus Service Discovery

Hard-coding scrape targets is not scalable. Service Discovery (SD) lets Prometheus automatically find and track targets as they come and go โ€” critical in dynamic ECS environments where tasks start and stop frequently.

prometheus.yml โ€” ECS SD configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:

  # File-based SD โ€” ECS task IPs written by deployment script
  - job_name: 'ecs-tasks'
    file_sd_configs:
      - files: ['/etc/prometheus/targets/ecs-*.json']
        refresh_interval: 30s

  # DNS SD for ECS Service Connect / Cloud Map
  - job_name: 'ecs-service-connect'
    dns_sd_configs:
      - names: ['_metrics._tcp.my-svc.local']
        type: SRV

  # Static targets โ€” node_exporter on EC2 hosts
  - job_name: 'node'
    static_configs:
      - targets: ['10.0.1.10:9100', '10.0.1.11:9100']

7. Prometheus Pushgateway

The Pushgateway solves a specific problem: short-lived jobs (ECS batch tasks, cron jobs, one-off tasks) that finish before Prometheus gets a chance to scrape them. These jobs push their metrics to the Pushgateway, which Prometheus then scrapes like any other target.

When to use: ECS Fargate batch jobs, nightly ETL pipelines, data export tasks. Do NOT use as a general-purpose proxy โ€” it breaks stale data detection and the pull model.

push-metrics.sh โ€” push from ECS batch task
#!/bin/bash
PUSHGATEWAY="http://pushgateway:9091"
JOB="nightly_etl"

cat <<EOF | curl --data-binary @- "${PUSHGATEWAY}/metrics/job/${JOB}"
# HELP etl_duration_seconds Time taken for the ETL job
# TYPE etl_duration_seconds gauge
etl_duration_seconds 142
# HELP etl_records_processed Total records processed
# TYPE etl_records_processed counter
etl_records_processed 48503
EOF

8. Prometheus Client Libraries

Client libraries let you instrument your application code directly โ€” exposing custom business metrics alongside default runtime metrics.

๐Ÿน
Go
github.com/prometheus/client_golang โ€” the reference implementation. Best-in-class for Go microservices in ECS.
๐Ÿ
Python
prometheus_client โ€” great for ML model servers, Flask APIs, and Django apps running in ECS Fargate.
โ˜•
Java / JVM
Micrometer with the Prometheus registry. Spring Boot auto-exposes /actuator/prometheus out-of-the-box.
๐ŸŸข
Node.js
prom-client โ€” widely used with Express.js. Automatically exposes default Node.js metrics (event loop lag, GC, heap).
app.py โ€” python client instrumentation
from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter(
    'http_requests_total', 'Total HTTP requests',
    ['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds', 'Request latency',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)

def handle_request(method, endpoint):
    start = time.time()
    # ... process request ...
    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status="200").inc()
    REQUEST_LATENCY.labels(endpoint=endpoint).observe(time.time() - start)

start_http_server(8000)  # exposes /metrics on :8000

9. Prometheus AlertManager

AlertManager handles alerts sent by the Prometheus server. It is responsible for deduplication, grouping, routing, silencing, and inhibition before dispatching notifications.

1
Alerting Rules: Defined in rules.yml. Prometheus evaluates these every evaluation_interval. When an expression is true for longer than for, the alert fires.
2
Routing Tree: Routes each alert to the right receiver (Slack, PagerDuty, OpsGenie, email) based on label matchers.
3
Grouping: Batches related alerts to avoid alert storms. If 20 ECS tasks fail at once, you get 1 grouped notification, not 20.
4
Silences & Inhibition: Silence alerts during maintenance windows. Inhibit low-priority alerts when a critical one is already firing.
alerting-rules.yml โ€” ECS production rules
groups:
  - name: ecs-alerts
    rules:

      # Alert when ECS container CPU > 90% for 5 minutes
      - alert: ECSHighCPU
        expr: rate(container_cpu_usage_seconds_total[5m]) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "ECS task {{ $labels.container_name }} high CPU"
          description: "CPU at {{ $value }}%"

      # Alert when error rate > 5%
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) * 100 > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "HTTP error rate above 5%: {{ $value | printf \"%.1f\" }}%"

10. PromQL โ€” Prometheus Query Language

PromQL is a functional query language for selecting and aggregating time-series data. Every Grafana panel and alert expression uses PromQL under the hood.

The Four Metric Types

๐Ÿ“ˆ
Counter
Monotonically increasing. e.g. http_requests_total. Always use with rate() โ€” never read raw.
๐Ÿ“Š
Gauge
Can go up and down. e.g. memory_usage_bytes, active_connections. Read the raw value directly.
๐Ÿ“‰
Histogram
Samples observations into configurable buckets. Used for latency. Use with histogram_quantile().
๐ŸŽฏ
Summary
Calculates quantiles on the client side. Less flexible for cross-instance aggregation โ€” prefer Histogram.

Essential PromQL Cheat Sheet

promql-cheatsheet โ€” ECS context
Use CasePromQL Expression
HTTP request raterate(http_requests_total[5m])
5xx error rate (%)sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
p99 request latencyhistogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
ECS container CPU %rate(container_cpu_usage_seconds_total[5m]) * 100
ECS container memory %container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
Avg latency by serviceavg by (job) (rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))
Active ECS targetscount(up{job="ecs-tasks"} == 1) by (job)
Targets currently downup == 0

PromQL Operators Quick Reference

  • rate(metric[5m]) โ€” per-second rate of a counter over 5 minutes
  • irate(metric[5m]) โ€” instantaneous rate (last 2 samples) โ€” more responsive but spiky
  • increase(metric[1h]) โ€” total increase of a counter over 1 hour
  • sum by (label) โ€” aggregate, grouped by a label
  • avg without (instance) โ€” average, dropping the instance label
  • topk(5, metric) โ€” top 5 time-series by value
  • predict_linear(metric[1h], 3600) โ€” predict value in 1 hour via linear regression

Architecture Summary โ€” How it all flows

1
ECS Tasks use client libraries to expose /metrics, or you run exporters (node_exporter, cAdvisor) alongside them.
2
Service Discovery (file_sd, dns_sd, ECS SD) tells Prometheus the list of active targets. As tasks scale or restart, the list updates automatically.
3
Prometheus Retrieval scrapes all targets every 15s, relabels samples, and writes them to the TSDB.
4
Short-lived ECS batch jobs push metrics to the Pushgateway, which Prometheus then scrapes like any other target.
5
Alerting rules are evaluated by Prometheus. Fired alerts are sent to AlertManager, which deduplicates, groups, and routes them to Slack / PagerDuty.
6
Grafana queries the Prometheus HTTP API using PromQL to power dashboards and panels in real-time.

Prometheus + Grafana + AlertManager is the golden observability stack for ECS environments. Master these three tools and you have complete visibility into every layer of your infrastructure โ€” from EC2 host metrics all the way up to business-level request rates and SLOs.