Prometheus Concepts 🚀 — Dinesh Sandilyan

Prometheus is the de-facto open-source monitoring standard for cloud-native environments. If you work with AWS ECS, Kubernetes, or any distributed system, you will encounter it. In this blog, we tear down every component of the Prometheus architecture — visually, with real-world ECS context — so you walk away truly understanding how the pieces fit together.

In this blog, you will learn the following:

What is Prometheus Architecture?
Prometheus Server
Time-Series Database (TSDB)
Prometheus Targets
Prometheus Exporters
Prometheus Service Discovery
Prometheus Pushgateway
Prometheus Client Libraries
Prometheus Alert Manager
PromQL

Prometheus Architecture — ECS Flow

The diagram below shows how all Prometheus components interact in a typical AWS ECS environment. Watch the data flows between components.

ECS Tasks

Batch & Exporters

Service Discovery / Grafana

Prometheus Core

AlertManager

── animated flow · - - = logical

1. What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud, now a graduated CNCF project. It is written in Go and follows a pull-based metrics collection model — instead of agents pushing data to a central server, Prometheus actively scrapes (HTTP GET) the /metrics endpoint of targets at a regular interval.

Key idea: Prometheus stores everything as time-series data — a stream of timestamped values identified by a metric name and key-value labels. e.g. http_requests_total{method="GET", status="200"}

2. Prometheus Server

The Prometheus server is the heart of the system. It has three internal sub-components that work together:

🔍

Retrieval Engine

Performs HTTP scrapes against all configured targets at a defined interval (default: 15 seconds). Handles service discovery, relabelling, and authentication.

🗄️

TSDB Storage

Stores scraped metrics in a custom time-series database on local disk. Uses a WAL (Write-Ahead Log) for durability and 2-hour chunks for efficient range reads.

🌐

HTTP API Server

Exposes a PromQL query interface on port :9090. Grafana and other tools use this endpoint to run queries and render dashboards.

3. Time-Series Database (TSDB)

Prometheus's built-in TSDB is purpose-built for time-series workloads. Unlike relational databases, it stores data in chunks of 2-hour blocks that are compacted over time to save disk space.

Ingestion: Scraped samples are appended to an in-memory WAL and 2-hour memory-mapped chunks.

Compaction: Every 2 hours, chunks are flushed to disk as immutable block directories with an index for fast lookups.

Retention: Default 15 days. For long-term storage, use the remote-write API to push to Thanos or Cortex.

prometheus data directory layout

# /prometheus/data/
.
├── 01H8G3...  # 2-hour block (immutable)
│   ├── chunks/
│   │   └── 000001
│   ├── index
│   └── meta.json
├── 01H8G4...  # another block
├── wal/        # Write-Ahead Log (current)
│   ├── 00000001
│   └── checkpoint.000005/
└── lock

4. Prometheus Targets

A target is any endpoint that exposes metrics in the Prometheus text format at a /metrics HTTP path. In an ECS environment, targets can be:

ECS Tasks — your microservices instrumented with a client library
Node Exporter — running as a sidecar or daemon on the EC2 host
cAdvisor — container-level CPU, memory, and network metrics
Custom exporters — e.g., ECS task metadata endpoint wrapped as a Prometheus target
AWS services — via yet-another-cloudwatch-exporter (YACE)

5. Prometheus Exporters

Exporters are adapter processes that translate metrics from systems that don't natively speak the Prometheus format into something Prometheus can scrape. They run alongside the target and expose a /metrics endpoint.

🖥️

node_exporter

Hardware and OS metrics from Linux hosts. CPU, memory, disk I/O, network — the gold standard for ECS EC2 host monitoring.

🐳

cAdvisor

Container resource usage per task in ECS. Runs as a daemon or sidecar. Reports per-container CPU/memory/network.

🗃️

mysqld_exporter

MySQL / RDS performance metrics: queries/sec, connections, InnoDB buffer pool hit rate, replication lag.

☁️

YACE (CloudWatch)

Bridges AWS CloudWatch metrics (ALB, RDS, SQS, DynamoDB) into Prometheus. Ideal for full ECS stack visibility.

🔴

redis_exporter

Redis / ElastiCache metrics: connected clients, used memory, hit/miss ratio, replication offset.

🌍

blackbox_exporter

Probes external endpoints over HTTP, HTTPS, DNS, TCP. Great for uptime monitoring and SSL expiry alerting.

6. Prometheus Service Discovery

Hard-coding scrape targets is not scalable. Service Discovery (SD) lets Prometheus automatically find and track targets as they come and go — critical in dynamic ECS environments where tasks start and stop frequently.

prometheus.yml — ECS SD configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:

  # File-based SD — ECS task IPs written by deployment script
  - job_name: 'ecs-tasks'
    file_sd_configs:
      - files: ['/etc/prometheus/targets/ecs-*.json']
        refresh_interval: 30s

  # DNS SD for ECS Service Connect / Cloud Map
  - job_name: 'ecs-service-connect'
    dns_sd_configs:
      - names: ['_metrics._tcp.my-svc.local']
        type: SRV

  # Static targets — node_exporter on EC2 hosts
  - job_name: 'node'
    static_configs:
      - targets: ['10.0.1.10:9100', '10.0.1.11:9100']

7. Prometheus Pushgateway

The Pushgateway solves a specific problem: short-lived jobs (ECS batch tasks, cron jobs, one-off tasks) that finish before Prometheus gets a chance to scrape them. These jobs push their metrics to the Pushgateway, which Prometheus then scrapes like any other target.

When to use: ECS Fargate batch jobs, nightly ETL pipelines, data export tasks. Do NOT use as a general-purpose proxy — it breaks stale data detection and the pull model.

push-metrics.sh — push from ECS batch task

#!/bin/bash
PUSHGATEWAY="http://pushgateway:9091"
JOB="nightly_etl"

cat <<EOF | curl --data-binary @- "${PUSHGATEWAY}/metrics/job/${JOB}"
# HELP etl_duration_seconds Time taken for the ETL job
# TYPE etl_duration_seconds gauge
etl_duration_seconds 142
# HELP etl_records_processed Total records processed
# TYPE etl_records_processed counter
etl_records_processed 48503
EOF

8. Prometheus Client Libraries

Client libraries let you instrument your application code directly — exposing custom business metrics alongside default runtime metrics.

🐹

github.com/prometheus/client_golang — the reference implementation. Best-in-class for Go microservices in ECS.

🐍

Python

prometheus_client — great for ML model servers, Flask APIs, and Django apps running in ECS Fargate.

☕

Java / JVM

Micrometer with the Prometheus registry. Spring Boot auto-exposes /actuator/prometheus out-of-the-box.

🟢

Node.js

prom-client — widely used with Express.js. Automatically exposes default Node.js metrics (event loop lag, GC, heap).

app.py — python client instrumentation

from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter(
    'http_requests_total', 'Total HTTP requests',
    ['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds', 'Request latency',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)

def handle_request(method, endpoint):
    start = time.time()
    # ... process request ...
    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status="200").inc()
    REQUEST_LATENCY.labels(endpoint=endpoint).observe(time.time() - start)

start_http_server(8000)  # exposes /metrics on :8000

9. Prometheus AlertManager

AlertManager handles alerts sent by the Prometheus server. It is responsible for deduplication, grouping, routing, silencing, and inhibition before dispatching notifications.

Alerting Rules: Defined in rules.yml. Prometheus evaluates these every evaluation_interval. When an expression is true for longer than for, the alert fires.

Routing Tree: Routes each alert to the right receiver (Slack, PagerDuty, OpsGenie, email) based on label matchers.

Grouping: Batches related alerts to avoid alert storms. If 20 ECS tasks fail at once, you get 1 grouped notification, not 20.

Silences & Inhibition: Silence alerts during maintenance windows. Inhibit low-priority alerts when a critical one is already firing.

alerting-rules.yml — ECS production rules

groups:
  - name: ecs-alerts
    rules:

      # Alert when ECS container CPU > 90% for 5 minutes
      - alert: ECSHighCPU
        expr: rate(container_cpu_usage_seconds_total[5m]) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "ECS task {{ $labels.container_name }} high CPU"
          description: "CPU at {{ $value }}%"

      # Alert when error rate > 5%
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) * 100 > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "HTTP error rate above 5%: {{ $value | printf \"%.1f\" }}%"

10. PromQL — Prometheus Query Language

PromQL is a functional query language for selecting and aggregating time-series data. Every Grafana panel and alert expression uses PromQL under the hood.

The Four Metric Types

📈

Counter

Monotonically increasing. e.g. http_requests_total. Always use with rate() — never read raw.

📊

Gauge

Can go up and down. e.g. memory_usage_bytes, active_connections. Read the raw value directly.

📉

Histogram

Samples observations into configurable buckets. Used for latency. Use with histogram_quantile().

🎯

Summary

Calculates quantiles on the client side. Less flexible for cross-instance aggregation — prefer Histogram.

Essential PromQL Cheat Sheet

promql-cheatsheet — ECS context

Use Case	PromQL Expression
HTTP request rate	`rate(http_requests_total[5m])`
5xx error rate (%)	`sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100`
p99 request latency	`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))`
ECS container CPU %	`rate(container_cpu_usage_seconds_total[5m]) * 100`
ECS container memory %	`container_memory_usage_bytes / container_spec_memory_limit_bytes * 100`
Avg latency by service	`avg by (job) (rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))`
Active ECS targets	`count(up{job="ecs-tasks"} == 1) by (job)`
Targets currently down	`up == 0`

PromQL Operators Quick Reference

rate(metric[5m]) — per-second rate of a counter over 5 minutes
irate(metric[5m]) — instantaneous rate (last 2 samples) — more responsive but spiky
increase(metric[1h]) — total increase of a counter over 1 hour
sum by (label) — aggregate, grouped by a label
avg without (instance) — average, dropping the instance label
topk(5, metric) — top 5 time-series by value
predict_linear(metric[1h], 3600) — predict value in 1 hour via linear regression

Architecture Summary — How it all flows

ECS Tasks use client libraries to expose /metrics, or you run exporters (node_exporter, cAdvisor) alongside them.

Service Discovery (file_sd, dns_sd, ECS SD) tells Prometheus the list of active targets. As tasks scale or restart, the list updates automatically.

Prometheus Retrieval scrapes all targets every 15s, relabels samples, and writes them to the TSDB.

Short-lived ECS batch jobs push metrics to the Pushgateway, which Prometheus then scrapes like any other target.

Alerting rules are evaluated by Prometheus. Fired alerts are sent to AlertManager, which deduplicates, groups, and routes them to Slack / PagerDuty.

Grafana queries the Prometheus HTTP API using PromQL to power dashboards and panels in real-time.

Prometheus + Grafana + AlertManager is the golden observability stack for ECS environments. Master these three tools and you have complete visibility into every layer of your infrastructure — from EC2 host metrics all the way up to business-level request rates and SLOs.