Build Your Own Docker-Based Observability Stack

article-01.11.2025

Hosted observability platforms are convenient until cost, data residency, or vendor lock-in become blockers. This article walks through a fully self-hosted monitoring stack deployed with Docker Compose on a single VPS. Every configuration file is reproduced below, along with architecture diagrams, operating tips, and lessons learned while migrating away from Grafana Cloud Agent.

Expect deep dives into each component - from Docker Compose wiring to Grafana provisioning - alongside commentary on why specific decisions serve a secure, host-terminated HTTPS setup. Bring a terminal and the willingness to inspect configs line by line. The narrative ties the moving pieces together so you can apply the pattern to your own workloads.

This article was produced 95% with Codex, guided and reviewed by the author who steered the narrative and decisions. The full working setup took roughly 8 hours, combining the author's Linux and observability expertise with the Codex CLI's computing power to generate every artifact.

1. Architecture Overview

1.1 Topology

Level 1 – System Context

Level 2 – Container View (inside Observability)

%%{init: { "theme": "base", "flowchart": { "htmlLabels": true, "curve": "linear" }, "themeVariables": { "fontFamily": "\"JetBrains Mono\",\"Fira Code\",\"Segoe UI\",sans-serif", "fontSize": "18px", "primaryColor": "#ffffff", "primaryTextColor": "#111827", "primaryBorderColor": "#4b5563", "lineColor": "#4b5563", "clusterBkg": "#f3f4f6", "clusterBorder": "#d1d5db", "background": "#f9fafb" } }}%% flowchart LR %% ===================================================== %% COLUMN 1: ENTRY / HOST %% ===================================================== subgraph Host["Host"] direction TB HLogs["Host logs
/var/log/*.log"] HDocker["Docker Engine
/var/run/docker.sock"] HNginx["Nginx
TLS termination"] end Browser["Browser"] Browser -->|"HTTPS 443"| HNginx %% ===================================================== %% COLUMN 2: APPS & AGENTS (produce telemetry) %% ===================================================== subgraph Apps["Workload / Exporters"] direction TB Demo["monitoring-demo-app
127.0.0.1:7005"] NodeExp["Node Exporter
host metrics"] CAdv["cAdvisor
container metrics"] end %% ===================================================== %% COLUMN 3: INGESTION & UI %% ===================================================== subgraph Ingest["Ingestion / UI"] direction TB Grafana["Grafana
127.0.0.1:3000"] OTEL["OTel Collector
ingestion gateway"] Promtail["Promtail
log shipper"] end %% ===================================================== %% COLUMN 4: BACKENDS / STORAGE %% ===================================================== subgraph Backends["Backends"] direction TB Prom["Prometheus
metrics TSDB"] Loki["Loki
log store"] Tempo["Tempo
traces"] TMem["Tempo Memcached
trace index cache"] end %% ===================================================== %% FLOWS %% ===================================================== %% Host → Ingestion (logs) HLogs -->|"static_configs"| Promtail HDocker -->|"docker_sd_configs"| Promtail %% Host → UI HNginx -->|"HTTP 127.0.0.1:3000"| Grafana %% Apps → Ingestion / Backends Demo -->|"OTLP traces"| OTEL Demo -->|"/metrics"| Prom Demo -->|"stdout / stderr"| Promtail NodeExp -->|"metrics"| Prom CAdv -->|"metrics"| Prom %% Ingestion → Backends Promtail -->|"push logs"| Loki OTEL --> Tempo Tempo --> TMem %% Grafana → Backends Grafana -->|"Dashboards"| Prom Grafana -->|"Explore logs"| Loki Grafana -->|"Explore traces"| Tempo

1.2 Telemetry Flow

%%{init: { "theme": "base", "sequence": { "mirrorActors": false, "rightAngles": true, "showSequenceNumbers": true } }}%% sequenceDiagram participant Script as generate_demo_traffic.sh participant App as monitoring-demo-app participant Prom as Prometheus participant PTail as Promtail participant Loki as Loki participant Collector as OTel Collector participant Tempo as Tempo participant Grafana as Grafana Script->>App: HTTP calls
/hello
/work Note right of Script: Load generator Prom->>App: GET /metrics Note over Prom,App: Pull based scrape every 15s App->>Collector: OTLP /v1/traces Collector->>Tempo: Export spans (gRPC) App->>PTail: stdout / stderr to container logs PTail->>Loki: Push enriched logs Grafana->>Prom: PromQL queries Grafana->>Loki: LogQL queries
(container_name="monitoring-demo-app") Grafana->>Tempo: TraceQL queries

Key principles - Grafana is the only service bound to localhost (127.0.0.1:3000). Host Nginx terminates TLS for https://monitoring.services.org.pl. - Everything else communicates over the private Docker network obs. - Persistent data lives under /opt/docker-volumes/<service>/... on the host.

2. Prerequisites

Before following the walkthrough, you should be comfortable with:

Administering Docker and Docker Compose on a Linux server (SSH, system packages, file permissions).
Core observability concepts—metrics, logs, traces—and how Grafana, Prometheus, Loki, and Tempo expose them.
Basic networking and TLS offload patterns so the host Nginx scenario feels familiar.

2.1 Install Docker & Compose (Ubuntu 22.04 example)

sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
  sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker "$USER"   # log out/in afterwards

2.2 Prepare Persistent Directories

sudo mkdir -p \
  /opt/docker-volumes/grafana/data \
  /opt/docker-volumes/prometheus/data \
  /opt/docker-volumes/loki/data \
  /opt/docker-volumes/tempo/data \
  /opt/docker-volumes/promtail/positions

sudo chown -R 472:472 /opt/docker-volumes/grafana
sudo chmod 750 /opt/docker-volumes/grafana /opt/docker-volumes/grafana/data

2.3 Host Nginx & TLS

TLS termination happens on the host. A reference vhost:

server {
  listen 443 ssl http2;
  server_name monitoring.services.org.pl;

  ssl_certificate /etc/letsencrypt/live/monitoring.services.org.pl/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/monitoring.services.org.pl/privkey.pem;

  location / {
    proxy_pass http://127.0.0.1:3000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto https;
  }
}

Certbot (--nginx) keeps certificates valid; no Nginx container is used.

3. Repository Layout

.
├── docker-compose.yml
├── grafana/
│   └── provisioning/
│       ├── dashboards/demo_app_overview.json
│       └── datasources/datasources.yml
├── prometheus/
│   ├── prometheus.yml
│   └── rules/alerts.yml
├── promtail/config.yml
├── tempo/tempo.yaml
├── otel-collector/config.yaml
├── app/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── app.py
│   └── generate_demo_traffic.sh
├── storage_usage.sh
└── docker_memory_usage.sh

Copy each snippet below into the matching path if you are recreating the stack from scratch.

4. Docker Compose Stack

docker-compose.yml

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31

services:
  grafana:
    image: grafana/grafana:12.2.1
    container_name: grafana
    restart: unless-stopped
    environment:
      GF_SERVER_DOMAIN: monitoring.services.org.pl
      GF_SERVER_ROOT_URL: https://monitoring.services.org.pl/
    ports:
      - "127.0.0.1:3000:3000"
    volumes:
      - /opt/docker-volumes/grafana/data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    depends_on:
      - prometheus
      - loki
      - tempo
    networks:
      - obs

  prometheus:
    image: prom/prometheus:v3.7.3
    container_name: prometheus
    restart: unless-stopped
    user: "0"
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --web.enable-lifecycle
    expose:
      - "9090"
    volumes:
      - /opt/docker-volumes/prometheus/data:/prometheus
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
    networks:
      - obs

  loki:
    image: grafana/loki:3.5.7
    container_name: loki
    restart: unless-stopped
    user: "0"
    command:
      - -config.file=/etc/loki/config.yaml
    expose:
      - "3100"
    volumes:
      - /opt/docker-volumes/loki/data:/loki
      - ./loki/config.yaml:/etc/loki/config.yaml:ro
    networks:
      - obs

  tempo:
    image: grafana/tempo:2.9.0
    container_name: tempo
    restart: unless-stopped
    user: "0"
    command:
      - -config.file=/etc/tempo/tempo.yaml
    depends_on:
      - memcached
    expose:
      - "3200"
      - "4317"
      - "4318"
    volumes:
      - /opt/docker-volumes/tempo/data:/var/tempo
      - ./tempo/tempo.yaml:/etc/tempo/tempo.yaml:ro
    networks:
      - obs
      - tempo-cache

  memcached:
    image: memcached:1.6.33-alpine
    container_name: tempo-memcached
    restart: unless-stopped
    command:
      - -m
      - "256"
      - -p
      - "11211"
    expose:
      - "11211"
    networks:
      - tempo-cache

  otelcol:
    image: otel/opentelemetry-collector-contrib:0.138.0
    container_name: otelcol
    restart: unless-stopped
    command:
      - --config=/etc/otelcol/config.yaml
    expose:
      - "4317"
      - "4318"
      - "8888"
      - "55679"
    volumes:
      - ./otel-collector/config.yaml:/etc/otelcol/config.yaml:ro
    depends_on:
      - tempo
      - loki
      - prometheus
    networks:
      - obs

  promtail:
    image: grafana/promtail:3.5.7
    container_name: promtail
    restart: unless-stopped
    command:
      - -config.file=/etc/promtail/config.yml
    volumes:
      - /opt/docker-volumes/promtail/positions:/var/lib/promtail
      - ./promtail/config.yml:/etc/promtail/config.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - loki
    networks:
      - obs

  node-exporter:
    image: prom/node-exporter:v1.10.2
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    command:
      - --path.procfs=/host/proc
      - --path.sysfs=/host/sys
      - --path.rootfs=/rootfs
      - --collector.filesystem.ignored-mount-points=^/(proc|sys|dev|host|etc)($|/)
      - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
      - --no-collector.ipvs
      - --no-collector.btrfs
      - --no-collector.infiniband
      - --no-collector.xfs
      - --no-collector.zfs
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    networks:
      - obs

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.52.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    expose:
      - "8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    networks:
      - obs

  monitoring-demo-app:
    build:
      context: ./app
    container_name: monitoring-demo-app
    restart: unless-stopped
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otelcol:4318
      OTEL_SERVICE_NAME: monitoring-demo-app
      OTEL_RESOURCE_ATTRIBUTES: deployment.environment=demo
    depends_on:
      - otelcol
    ports:
      - "127.0.0.1:7005:8000"
    networks:
      - obs

networks:
  obs:
    driver: bridge
  tempo-cache:
    driver: bridge

Highlights: - Grafana alone publishes to the host (localhost only); everything else uses expose. - memcached accelerates Tempo search. - monitoring-demo-app is a local test service exporting metrics, logs, and traces.

5. Prometheus Metrics

prometheus/prometheus.yml

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - prometheus:9090

  - job_name: cadvisor
    static_configs:
      - targets:
          - cadvisor:8080

  - job_name: node-exporter
    static_configs:
      - targets:
          - node-exporter:9100

  - job_name: monitoring-demo-app
    metrics_path: /metrics
    static_configs:
      - targets:
          - monitoring-demo-app:8000

Alerting starter pack (prometheus/rules/alerts.yml):

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
groups:
  - name: infrastructure-health
    rules:
      - alert: PrometheusTargetMissing
        expr: up == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Target {{ $labels.job }} on {{ $labels.instance }} is down
          description: Prometheus has not scraped {{ $labels.job }} on {{ $labels.instance }} for over 5 minutes.

Retention: Prometheus keeps 15 days by default (no explicit --storage.tsdb.retention.time). Adjust the Compose command arguments if you need longer storage.

6. Logs with Promtail & Loki

promtail/config.yml

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system-logs
    pipeline_stages:
      - drop:
          older_than: 24h
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          host: ${HOSTNAME}
          __path__: /var/log/*.log

  - job_name: docker-containers
    pipeline_stages:
      - docker: {}
      - drop:
          older_than: 24h
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: ['__meta_docker_container_id']
        target_label: '__path__'
        replacement: /var/lib/docker/containers/$1/$1-json.log
      - source_labels: ['__meta_docker_container_name']
        target_label: 'container_name'
        regex: '/(.*)'
        replacement: '$1'
      - source_labels: ['__meta_docker_container_id']
        target_label: 'container_id'
      - source_labels: ['__meta_docker_container_image']
        target_label: 'container_image'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: 'service_name'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_project']
        target_label: 'compose_project'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'stream'
      - target_label: 'job'
        replacement: 'containers'
      - target_label: 'host'
        replacement: ${HOSTNAME}

Notes: - Promtail adds container_name and service_name labels, making it easy to filter by Compose service (e.g. {job="containers", container_name="monitoring-demo-app"}). - Journald ingestion was removed because this host keeps logs in memory-only /run/systemd/journal; tailing /var/log/*.log covers the important services without additional setup. - Loki stores data in /opt/docker-volumes/loki/data; retention is managed inside Loki’s config (loki/config.yaml) and defaults to compactor-managed chunk pruning.

7. Traces with Tempo & OpenTelemetry Collector

tempo/tempo.yaml

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
server:
  http_listen_port: 3200
  log_level: info

cache:
  background:
    writeback_goroutines: 5
  caches:
    - roles:
        - frontend-search
      memcached:
        addresses: memcached:11211

query_frontend:
  metrics:
    max_duration: 200h
    query_backend_after: 5m
  search:
    duration_slo: 5s
    throughput_bytes_slo: 1073741824
  trace_by_id:
    duration_slo: 100ms

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

compactor:
  compaction:
    block_retention: 720h

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: docker-compose
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    local_blocks:
      filter_server_spans: false
      flush_to_storage: true

overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics
        - local-blocks

Tempo keeps 30 days (720h) of trace blocks locally. Memcached speeds up search queries; if you run on a tiny VPS you can shrink -m 256 in Compose.

OpenTelemetry Collector (otel-collector/config.yaml) receives OTLP traffic and fans out to Tempo plus Prometheus remote write:

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100
  batch:
    timeout: 5s
    send_batch_size: 8192
  resource:
    attributes:
      - action: upsert
        key: deployment.environment
        value: prod

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  debug:
    verbosity: basic

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

service:
  extensions:
    - health_check
  telemetry:
    metrics:
      level: basic
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - memory_limiter
        - resource
        - batch
      exporters:
        - otlp/tempo
    metrics:
      receivers:
        - otlp
      processors:
        - memory_limiter
        - resource
        - batch
      exporters:
        - prometheusremotewrite

8. Grafana Provisioning

grafana/provisioning/datasources/datasources.yml

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
apiVersion: 1

datasources:
  - uid: prometheus
    name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      httpMethod: GET

  - uid: loki
    name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 5000

  - uid: tempo
    name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      httpMethod: GET
      serviceMapDatasourceUid: prometheus
      tracesToLogs:
        datasourceUid: loki
        mapTagNamesEnabled: true
        tags:
          - job
          - host
      tracesToMetrics:
        datasourceUid: prometheus
        tags:
          - service.name

grafana/provisioning/dashboards/demo_app_overview.json

{
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 0.5
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 1,
      "options": {
        "legend": {
          "displayMode": "list",
          "placement": "bottom"
        }
      },
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(demo_app_request_duration_seconds_bucket[5m])) by (le))",
          "legendFormat": "p95 latency",
          "refId": "A"
        }
      ],
      "title": "Demo App Request Duration (p95)",
      "type": "timeseries"
    }
  ],
  "schemaVersion": 38,
  "style": "dark",
  "tags": [
    "demo"
  ],
  "title": "Demo App Overview",
  "uid": "demo-app-overview"
}

9. Demo Application (`monitoring-demo-app`)

9.1 Dockerfile

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
FROM python:3.12-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --upgrade pip && pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt

FROM python:3.12-slim

ENV PYTHONUNBUFFERED=1
WORKDIR /app

COPY --from=builder /wheels /wheels
COPY --from=builder /app/requirements.txt .
RUN pip install --no-cache-dir --find-links=/wheels -r requirements.txt

COPY app.py .

EXPOSE 8000
CMD ["python", "app.py"]

9.2 Dependencies

app/requirements.txt

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
flask==3.0.3
prometheus-client==0.20.0
opentelemetry-api==1.27.0
opentelemetry-sdk==1.27.0
opentelemetry-exporter-otlp-proto-http==1.27.0
opentelemetry-instrumentation-flask==0.48b0

9.3 Application Code

app/app.py

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
import logging
import os
import random
import time
from datetime import datetime

from flask import Flask, jsonify, request
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.resources import Attributes, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from prometheus_client import Counter, Gauge, Histogram, generate_latest


def _setup_logging() -> None:
  logging.basicConfig(
      level=logging.INFO,
      format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
  )


def _setup_tracing() -> None:
  endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otelcol:4318")
  service_name = os.getenv("OTEL_SERVICE_NAME", "monitoring-demo-app")
  resource = Resource.create({"service.name": service_name})

  provider = TracerProvider(resource=resource)
  processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=f"{endpoint.rstrip('/')}/v1/traces"))
  provider.add_span_processor(processor)
  trace.set_tracer_provider(provider)


_setup_logging()
_setup_tracing()

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
logging.getLogger("werkzeug").setLevel(logging.WARNING)

LOGGER = logging.getLogger("monitoring-demo-app")
TRACER = trace.get_tracer(__name__)

REQUEST_COUNTER = Counter(
    "demo_app_requests_total",
    "Total number of requests handled by the demo application",
    ["endpoint", "method"],
)
TEMPERATURE_GAUGE = Gauge(
    "demo_app_temperature_celsius",
    "Simulated temperature value",
)
RESPONSE_HISTOGRAM = Histogram(
    "demo_app_request_duration_seconds",
    "Histogram of request durations",
    buckets=(0.01, 0.05, 0.1, 0.25, 0.5, 1, 2, 5),
)


def _observe_temperature() -> None:
  TEMPERATURE_GAUGE.set(18 + random.random() * 10)


@app.before_request
def before_request():
  request.start_time = time.perf_counter()


@app.after_request
def after_request(response):
  elapsed = time.perf_counter() - getattr(request, "start_time", time.perf_counter())
  RESPONSE_HISTOGRAM.observe(elapsed)
  REQUEST_COUNTER.labels(request.path, request.method).inc()
  return response


@app.route("/")
def index():
  LOGGER.info("Root endpoint hit", extra={"client_ip": request.remote_addr})
  return jsonify(
      message="Demo Python service for metrics, logs, and traces.",
      timestamp=datetime.utcnow().isoformat() + "Z",
  )


@app.route("/hello")
def hello():
  name = request.args.get("name", "world")
  LOGGER.info("Saying hello", extra={"hello_name": name})
  with TRACER.start_as_current_span("say-hello") as span:
    span.set_attribute("demo.greeting.name", name)
    time.sleep(random.uniform(0.01, 0.2))
  return jsonify(greeting=f"Hello, {name}!")


@app.route("/work")
def work():
  iterations = int(request.args.get("iterations", 3))
  with TRACER.start_as_current_span("simulate-work") as span:
    span.set_attribute("demo.work.iterations", iterations)
    total = 0
    for i in range(iterations):
      with TRACER.start_as_current_span("work-loop") as loop_span:
        loop_span.set_attribute("demo.work.loop_index", i)
        value = random.randint(1, 100)
        total += value
        LOGGER.debug("Loop iteration", extra={"index": i, "value": value})
        time.sleep(0.05)
  LOGGER.info("Work completed", extra={"work_iterations": iterations, "work_result": total})
  return jsonify(result=total, iterations=iterations)


@app.route("/metrics")
def metrics():
  _observe_temperature()
  return generate_latest(), 200, {"Content-Type": "text/plain; version=0.0.4"}


if __name__ == "__main__":
  host = os.getenv("APP_HOST", "0.0.0.0")
  port = int(os.getenv("APP_PORT", "8000"))
  LOGGER.info("Starting demo application", extra={"host": host, "port": port})
  app.run(host=host, port=port)

9.4 Traffic Generator

app/generate_demo_traffic.sh

#!/usr/bin/env bash
# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31

set -euo pipefail

BASE_URL=${BASE_URL:-http://127.0.0.1:7005}
ITERATIONS=${ITERATIONS:-5}

if ! command -v curl >/dev/null 2>&1; then
  echo "curl is required to run this script" >&2
  exit 1
fi

echo "Generating demo traffic against $BASE_URL"
for i in $(seq 1 "$ITERATIONS"); do
  name="Grafana-$i"
  echo "[$(date +'%H:%M:%S')] GET /hello?name=$name"
  curl -sf -G "$BASE_URL/hello" --data-urlencode "name=$name" >/dev/null

  loops=$(( (RANDOM % 5) + 1 ))
  echo "[$(date +'%H:%M:%S')] GET /work?iterations=$loops"
  curl -sf -G "$BASE_URL/work" --data-urlencode "iterations=$loops" >/dev/null

  sleep 1
done

echo "Demo traffic completed"

The app uses OTLP/HTTP for traces, Prometheus client for metrics, and standard logging for Loki ingestion. Werkzeug access logs are silenced, leaving the custom monitoring-demo-app logger in Grafana Explore.

10. Operational Utilities

10.1 Storage Footprint

storage_usage.sh

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
#!/usr/bin/env bash
set -euo pipefail

declare -A VOLUME_PATHS=(
  [grafana]="/opt/docker-volumes/grafana/data"
  [prometheus]="/opt/docker-volumes/prometheus/data"
  [loki]="/opt/docker-volumes/loki/data"
  [tempo]="/opt/docker-volumes/tempo/data"
  [promtail]="/opt/docker-volumes/promtail/positions"
)

printf "%-12s %-45s %12s\n" "Component" "Path" "Usage"
printf "%-12s %-45s %12s\n" "---------" "----" "-----"

total_bytes=0

for component in "${!VOLUME_PATHS[@]}"; do
  path="${VOLUME_PATHS[$component]}"
  if sudo test -d "$path"; then
    bytes=$(sudo du -sb -- "${path}" | cut -f1)
    human=$(numfmt --to=iec --suffix=B "$bytes")
  else
    bytes=0
    human="(missing)"
  fi
  total_bytes=$((total_bytes + bytes))
  printf "%-12s %-45s %12s\n" "$component" "$path" "$human"
done

printf "%-12s %-45s %12s\n" "---------" "----" "-----"
printf "%-12s %-45s %12s\n" "TOTAL" "-" "$(numfmt --to=iec --suffix=B "$total_bytes")"

10.2 Container Memory Reporter

docker_memory_usage.sh

# Generated by Codex Agent — do not edit manually
# Date: 2025-10-31
#!/usr/bin/env bash
set -euo pipefail

if ! command -v docker >/dev/null 2>&1; then
  echo "docker command not found."
  exit 1
fi

printf "%-25s %-40s %12s %12s\n" "CONTAINER ID" "NAME" "MEM USAGE" "LIMIT"
printf "%-25s %-40s %12s %12s\n" "------------" "----" "--------" "-----"

docker stats --no-stream --format "{{.Container}} {{.Name}} {{.MemUsage}} {{.MemPerc}}" | while read -r id name mem memperc; do
  usage=$(echo "$mem" | awk -F'/' '{print $1}')
  limit=$(echo "$mem" | awk -F'/' '{print $2}')
  printf "%-25s %-40s %12s %12s\n" "$id" "$name" "$usage" "$limit"
done

11. Bringing Everything Online

docker compose up -d          # build+start all services
docker compose ps             # confirm containers are healthy
./app/generate_demo_traffic.sh

Verification checklist: - Prometheus → Status > Targets shows monitoring-demo-app, node-exporter, and cadvisor as UP. - Grafana → Explore → Loki → {job="containers", container_name="monitoring-demo-app"} reveals enriched logs (with service_name and container_id labels). - Grafana → Explore → Tempo → service.name="monitoring-demo-app" fetches recent traces. - Grafana → Dashboards → Demo App Overview renders the p95 histogram panel using: promql histogram_quantile(0.95, sum(rate(demo_app_request_duration_seconds_bucket[5m])) by (le))

If you add more services, point them at http://otelcol:4318 for OTLP and add Prometheus scrape jobs where needed (follow the pattern in prometheus.yml).

12. Data Retention & Storage

Component	Path	Default Retention	How to Adjust
Grafana	`/opt/docker-volumes/grafana/data`	Until manual prune	Clean via Grafana UI or remove old dashboards/plugins
Prometheus	`/opt/docker-volumes/prometheus/data`	~15 days	Add `--storage.tsdb.retention.time=30d` in Compose
Loki	`/opt/docker-volumes/loki/data`	Depends on Loki config (chunks & compaction)	Tune in `loki/config.yaml`
Tempo	`/opt/docker-volumes/tempo/data`	720h (30 days)	Change `block_retention` in `tempo.yaml`
Promtail	`/opt/docker-volumes/promtail/positions`	Position offsets (logs kept in source files)	Adjust log retention at source

Run ./storage_usage.sh periodically to check disk consumption; it uses sudo internally to access protected paths.

13. Tips & Discoveries

Container labels in Loki: Docker SD + relabeling exposes container_name, service_name, and compose_project, making Grafana Explore queries much friendlier than raw container IDs.
Journald vs classic logs: Because /run/log/journal is ephemeral on this host, Promtail sticks to /var/log/*.log. If you persist journald, add a dedicated scrape job.
Tempo “empty ring” errors: They vanish once Memcached is healthy and the collector sends batches regularly. The included OTEL collector config handles this.
Grafana Cloud agent migration: If you previously relied on Grafana Cloud Agent, stop the service, remove the package, clear repo keys, and revoke its API token so traffic flows exclusively through this self-hosted stack.
Werkzeug noise reduction: Setting the access logger to WARNING keeps Loki focused on application logs while still showing Flask request traces.

14. Where to Go Next

Add alerting channels (Slack, Email) once you connect Grafana Alerting or Prometheus Alertmanager.
Onboard real services: mount /opt/otel/opentelemetry-javaagent.jar, set OTEL_EXPORTER_OTLP_ENDPOINT=http://otelcol:4318, and add Prometheus scrape jobs.
Automate smoke tests (e.g. curl endpoints + Grafana API checks) in CI before deploying Compose changes.

Happy monitoring!

Build Your Own Docker-Based Observability Stack

Adam Miler

1. Architecture Overview

1.1 Topology

Level 1 – System Context

Level 2 – Container View (inside Observability)

1.2 Telemetry Flow

2. Prerequisites

2.1 Install Docker & Compose (Ubuntu 22.04 example)

2.2 Prepare Persistent Directories

2.3 Host Nginx & TLS

3. Repository Layout

4. Docker Compose Stack

5. Prometheus Metrics

6. Logs with Promtail & Loki

7. Traces with Tempo & OpenTelemetry Collector

8. Grafana Provisioning

9. Demo Application (`monitoring-demo-app`)

9.1 Dockerfile

9.2 Dependencies

9.3 Application Code

9.4 Traffic Generator

10. Operational Utilities

10.1 Storage Footprint

10.2 Container Memory Reporter

11. Bringing Everything Online

12. Data Retention & Storage

13. Tips & Discoveries

14. Where to Go Next

Topics

1. Architecture Overview

1.1 Topology

Level 1 – System Context

Level 2 – Container View (inside Observability)

1.2 Telemetry Flow

2. Prerequisites

2.1 Install Docker & Compose (Ubuntu 22.04 example)

2.2 Prepare Persistent Directories

2.3 Host Nginx & TLS

3. Repository Layout

4. Docker Compose Stack

5. Prometheus Metrics

6. Logs with Promtail & Loki

7. Traces with Tempo & OpenTelemetry Collector

8. Grafana Provisioning

9. Demo Application (monitoring-demo-app)

9.1 Dockerfile

9.2 Dependencies

9.3 Application Code

9.4 Traffic Generator

10. Operational Utilities

10.1 Storage Footprint

10.2 Container Memory Reporter

11. Bringing Everything Online

12. Data Retention & Storage

13. Tips & Discoveries

14. Where to Go Next

Topics

9. Demo Application (`monitoring-demo-app`)