ahmet gedik

Posted on Jun 16

Monitoring Video Aggregator Health with a Go Prometheus Exporter

#go #prometheus #monitoring #observability

Last quarter we shipped a bug that nobody noticed for eleven hours. A YouTube Data API key tied to our Korea region hit its daily quota a little after midnight UTC. The PHP cron that refreshes trending videos kept running, kept getting 403 quotaExceeded, and — because the previous rows were still in SQLite — kept serving a frozen list. No 500s. No error spike. Cloudflare happily cached the pages. Our uptime monitor, which pings the homepage, stayed green the entire time. The only signal was a slow trickle of "this video isn't available" emails from users in Seoul.

That incident is the reason this article exists. At TopVideoHub we aggregate trending video across roughly nine Asia-Pacific regions, and the failure mode that actually hurts us is never a clean crash — it's silent rot. A crash pages someone. Rot just sits there serving 200 OK over stale or dead content. This post walks through the Go Prometheus exporter we built to turn the question "is the video layer actually healthy" into numbers we can scrape, graph, and page on, sitting next to a stack of PHP 8.4, SQLite with an FTS5 CJK index, LiteSpeed, and Cloudflare.

The failure modes a video aggregator actually has

Before writing a line of exporter code I made a list of the ways our video layer fails in production. Monitoring you design without that list ends up measuring CPU and disk while the actual product is on fire.

Per-region quota exhaustion. Each region's refresh job uses an API key (or rotates a small pool). One key can be throttled while every other region is fine, so a global "the API is up" check tells you nothing.
Stale data. The refresh cron silently no-ops on error and the last good rows keep serving. Freshness is a first-class health signal, not an afterthought.
Dead playback URLs. A video that was embeddable yesterday gets pulled, region-blocked, or made private. The metadata row survives; the player 404s.
Broken HLS manifests. For the streams we proxy, the manifest returns 200 OK but contains zero segments, or points at segment URLs that themselves 404. A status-code check passes; the stream is dead.
Search index regressions. Our SQLite FTS5 index uses a CJK-aware tokenizer. A bad migration can leave the index present but returning nothing for Japanese or Chinese queries.

Every one of these returns HTTP 200 to a naive checker. That is the whole problem, and it is why a simple HTTP ping is worse than useless here — it gives false confidence.

Why a sidecar probe and not in-request checks

The instinct is to add health checks into the request path: verify the video before rendering the page. Don't. Active probing is slow and flaky by nature — you are hitting third-party CDNs — and putting that latency in front of a LiteSpeed-served, Cloudflare-cached page would wreck the thing we optimize hardest for. It also doesn't compose: you would re-probe the same hot videos thousands of times and never touch the long tail.

Instead we run a separate Go binary — a Prometheus exporter — alongside the app. It does two things on its own schedule:

Reads cheap health facts straight out of the SQLite database (row counts, last-success timestamps per region).
Actively probes a sample of playback URLs and HLS manifests on a slow ticker, so we cover the catalog over time without hammering anyone.

Prometheus scrapes the exporter every fifteen seconds. The exporter never blocks a scrape on a live network probe — that is the single most important design rule here, and I'll come back to it.

Modeling the metrics

Good metrics are boring and stable. I want gauges for current state, counters for things that only ever increase, and a histogram for probe latency. Naming follows the Prometheus convention of namespace_subsystem_name in base units.

package probe

import "github.com/prometheus/client_golang/prometheus"

const (
    namespace = "tvh"
    subsystem = "video"
)

var (
    // ProbeUp is 1 when the last sampled probe for a region passed.
    ProbeUp = prometheus.NewGaugeVec(prometheus.GaugeOpts{
        Namespace: namespace, Subsystem: subsystem, Name: "probe_up",
        Help: "1 if the most recent playback probe for the region succeeded, else 0.",
    }, []string{"region"})

    // DataAgeSeconds is how long ago the refresh job last succeeded per region.
    DataAgeSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{
        Namespace: namespace, Subsystem: subsystem, Name: "data_age_seconds",
        Help: "Seconds since the trending refresh job last succeeded for the region.",
    }, []string{"region"})

    // PlayableRatio is the fraction of sampled videos that passed the probe.
    PlayableRatio = prometheus.NewGaugeVec(prometheus.GaugeOpts{
        Namespace: namespace, Subsystem: subsystem, Name: "playable_ratio",
        Help: "Fraction (0-1) of sampled videos playable in the last cycle.",
    }, []string{"region"})

    // ProbesTotal counts every probe attempt by result.
    ProbesTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace: namespace, Subsystem: subsystem, Name: "probes_total",
        Help: "Total playback probes performed, partitioned by result.",
    }, []string{"region", "result"}) // result: ok|dead|timeout|empty_manifest

    // ProbeDuration tracks probe latency so we can alert on slow CDNs.
    ProbeDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Namespace: namespace, Subsystem: subsystem, Name: "probe_duration_seconds",
        Help:    "Latency of an individual playback probe.",
        Buckets: prometheus.ExponentialBuckets(0.05, 2, 8), // 50ms .. ~6.4s
    }, []string{"region"})
)

// MustRegister wires every metric into the given registry.
func MustRegister(r prometheus.Registerer) {
    r.MustRegister(ProbeUp, DataAgeSeconds, PlayableRatio, ProbesTotal, ProbeDuration)
}

Note the result label on probes_total has bounded cardinality — four fixed values. That matters: a label like video_id would explode your time series and melt Prometheus. Keep labels to things with a small, stable set of values (region, result), never unbounded identifiers.

The prober: update on a ticker, never on scrape

Here is the rule I promised. The /metrics handler must return instantly. If you implement prometheus.Collector and do live HTTP probing inside Collect(), every scrape blocks on third-party CDNs, scrapes time out, and Prometheus marks the target down — ironically making your monitoring the least reliable thing in the stack.

So the prober runs in its own goroutine on a ticker and writes into pre-registered metric vectors. Collect() just reads whatever the last cycle wrote. Scrapes are always fast; probe freshness is bounded by the ticker interval.

package probe

import (
    "context"
    "database/sql"
    "log"
    "time"
)

type Prober struct {
    db       *sql.DB
    interval time.Duration
    sample   int // videos to probe per region per cycle
}

func NewProber(db *sql.DB, interval time.Duration, sample int) *Prober {
    return &Prober{db: db, interval: interval, sample: sample}
}

// Run blocks until ctx is cancelled, probing on each tick.
func (p *Prober) Run(ctx context.Context) {
    p.cycle(ctx) // probe once at startup so /metrics is meaningful immediately
    t := time.NewTicker(p.interval)
    defer t.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-t.C:
            p.cycle(ctx)
        }
    }
}

func (p *Prober) cycle(ctx context.Context) {
    regions, err := p.regions(ctx)
    if err != nil {
        log.Printf("probe: list regions: %v", err)
        return
    }
    for _, region := range regions {
        p.publishFreshness(ctx, region)
        p.probeRegion(ctx, region)
    }
}

// publishFreshness reads the last successful refresh time the PHP cron wrote.
func (p *Prober) publishFreshness(ctx context.Context, region string) {
    var lastEpoch sql.NullInt64
    err := p.db.QueryRowContext(ctx,
        `SELECT last_success_at FROM region_health WHERE region = ?`, region,
    ).Scan(&lastEpoch)
    if err != nil || !lastEpoch.Valid {
        DataAgeSeconds.WithLabelValues(region).Set(-1) // unknown
        return
    }
    age := time.Since(time.Unix(lastEpoch.Int64, 0)).Seconds()
    DataAgeSeconds.WithLabelValues(region).Set(age)
}

func (p *Prober) probeRegion(ctx context.Context, region string) {
    rows, err := p.db.QueryContext(ctx,
        `SELECT stream_url FROM videos
           WHERE region = ? AND stream_url IS NOT NULL
           ORDER BY RANDOM() LIMIT ?`, region, p.sample)
    if err != nil {
        log.Printf("probe: sample %s: %v", region, err)
        return
    }
    defer rows.Close()

    var ok, total int
    for rows.Next() {
        var url string
        if err := rows.Scan(&url); err != nil {
            continue
        }
        total++
        start := time.Now()
        result := checkHLS(ctx, url)
        ProbeDuration.WithLabelValues(region).Observe(time.Since(start).Seconds())
        ProbesTotal.WithLabelValues(region, result).Inc()
        if result == "ok" {
            ok++
        }
    }
    if total == 0 {
        ProbeUp.WithLabelValues(region).Set(0)
        PlayableRatio.WithLabelValues(region).Set(0)
        return
    }
    ratio := float64(ok) / float64(total)
    PlayableRatio.WithLabelValues(region).Set(ratio)
    if ratio >= 0.8 { // a region is "up" if >=80% of the sample is playable
        ProbeUp.WithLabelValues(region).Set(1)
    } else {
        ProbeUp.WithLabelValues(region).Set(0)
    }
}

func (p *Prober) regions(ctx context.Context) ([]string, error) {
    rows, err := p.db.QueryContext(ctx, `SELECT region FROM region_health`)
    if err != nil {
        return nil, err
    }
    defer rows.Close()
    var out []string
    for rows.Next() {
        var r string
        if err := rows.Scan(&r); err == nil {
            out = append(out, r)
        }
    }
    return out, rows.Err()
}

A few things in there earn their keep. ORDER BY RANDOM() LIMIT 25 means we rotate through the catalog instead of re-probing the same hot videos forever — over an hour we cover roughly 1,500 distinct URLs per region without ever spiking a single CDN. probeRegion never returns an error to its caller; a failed probe is data, recorded as a counter increment, not an exception to bubble up. And the 80% threshold for probe_up is deliberate: two dead videos in a sample of 25 is normal catalog churn, not an outage. You tune that number to your own churn rate.

Probing an HLS manifest correctly

A GET that returns 200 tells you almost nothing about an HLS stream. You have to read the manifest. The function below distinguishes four outcomes — ok, dead, timeout, and empty_manifest — because they want different alerts. A timeout means a slow CDN; a dead status means the URL is gone; an empty manifest means the origin is broken in a way a status code hides.

package probe

import (
    "bufio"
    "context"
    "io"
    "net/http"
    "strings"
    "time"
)

var probeClient = &http.Client{Timeout: 6 * time.Second}

// checkHLS fetches an .m3u8 and returns ok|dead|timeout|empty_manifest.
// It accepts either a master playlist (has variant streams) or a media
// playlist (has segments). A 200 with neither is empty_manifest.
func checkHLS(ctx context.Context, url string) string {
    body, status := fetchPlaylist(ctx, url)
    if status != "ok" {
        return status // dead or timeout
    }

    var hasSegment, hasVariant bool
    sc := bufio.NewScanner(strings.NewReader(body))
    for sc.Scan() {
        line := strings.TrimSpace(sc.Text())
        if line == "" {
            continue
        }
        if strings.HasPrefix(line, "#") {
            if strings.HasPrefix(line, "#EXT-X-STREAM-INF") {
                hasVariant = true // master playlist points at variants
            }
            continue
        }
        hasSegment = true // a non-comment line is a segment or variant URI
    }

    if hasSegment || hasVariant {
        return "ok"
    }
    return "empty_manifest"
}

func fetchPlaylist(ctx context.Context, url string) (body, status string) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
    if err != nil {
        return "", "dead"
    }
    req.Header.Set("User-Agent", "tvh-video-probe/1.0")
    resp, err := probeClient.Do(req)
    if err != nil {
        if ctx.Err() != nil || strings.Contains(err.Error(), "deadline") {
            return "", "timeout"
        }
        return "", "dead"
    }
    defer resp.Body.Close()
    if resp.StatusCode != http.StatusOK {
        return "", "dead"
    }
    var sb strings.Builder
    // Cap the read; manifests are small. io.EOF here is expected, not an error.
    _, _ = io.CopyN(&sb, resp.Body, 1<<20)
    return sb.String(), "ok"
}

The capped read (io.CopyN with a 1 MiB ceiling) is a small but real defense: a misbehaving origin streaming an endless body would otherwise pin a goroutine and memory. We don't need the whole thing — just enough to see a segment line.

Letting the PHP side publish freshness

The cheapest, most valuable signal — am I serving stale data — needs no network probe at all. Our PHP refresh job already knows when it last succeeded per region; it just has to write that down where the Go exporter can read it. We use a tiny region_health table in the same SQLite file.

<?php
// app/Health/RegionHealth.php — called by the refresh cron on success/failure.

final class RegionHealth
{
    public function __construct(private \PDO $db)
    {
        $this->db->exec(<<<SQL
            CREATE TABLE IF NOT EXISTS region_health (
                region          TEXT PRIMARY KEY,
                last_success_at INTEGER,   -- unix epoch, NULL until first success
                last_error      TEXT,
                updated_at      INTEGER NOT NULL
            )
        SQL);
    }

    public function recordSuccess(string $region): void
    {
        $now  = time();
        $stmt = $this->db->prepare(<<<SQL
            INSERT INTO region_health (region, last_success_at, last_error, updated_at)
            VALUES (:r, :now, NULL, :now)
            ON CONFLICT(region) DO UPDATE SET
                last_success_at = excluded.last_success_at,
                last_error      = NULL,
                updated_at      = excluded.updated_at
        SQL);
        $stmt->execute([':r' => $region, ':now' => $now]);
    }

    public function recordFailure(string $region, string $error): void
    {
        // We deliberately do NOT touch last_success_at. That is exactly what
        // lets the Go exporter watch data_age_seconds climb while the cron
        // keeps failing — the silent-rot signal that the 11-hour outage lacked.
        $stmt = $this->db->prepare(<<<SQL
            INSERT INTO region_health (region, last_error, updated_at)
            VALUES (:r, :err, :now)
            ON CONFLICT(region) DO UPDATE SET
                last_error = excluded.last_error,
                updated_at = excluded.updated_at
        SQL);
        $stmt->execute([':r' => $region, ':err' => $error, ':now' => time()]);
    }
}

That comment in recordFailure is the entire lesson of the opening incident encoded in one line. On failure we update last_error and updated_at but leave last_success_at frozen. So data_age_seconds grows monotonically while the cron is broken, no matter how many times it runs and fails. The Korea-quota scenario would now light up within minutes instead of hours.

Wiring the exporter

The main package opens the database read-only, registers the metrics on a private registry, starts the prober goroutine, and serves /metrics. Read-only matters: PHP and LiteSpeed own all writes to this database; the exporter is a guest and should never be able to lock or corrupt it.

package main

import (
    "context"
    "database/sql"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    _ "modernc.org/sqlite" // pure-Go SQLite driver, no cgo

    "example.com/tvh/probe"
)

func main() {
    dsn := os.Getenv("TVH_SQLITE_DSN") // e.g. file:/var/www/tvh/data/app.db?mode=ro
    if dsn == "" {
        log.Fatal("TVH_SQLITE_DSN is required")
    }
    db, err := sql.Open("sqlite", dsn)
    if err != nil {
        log.Fatalf("open db: %v", err)
    }
    db.SetMaxOpenConns(2) // read-only, keep it tiny
    defer db.Close()

    reg := prometheus.NewRegistry()
    probe.MustRegister(reg)

    ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
    defer stop()

    // Probe every 60s, sampling 25 videos per region per cycle.
    go probe.NewProber(db, 60*time.Second, 25).Run(ctx)

    mux := http.NewServeMux()
    mux.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))
    mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) {
        w.WriteHeader(http.StatusOK)
        _, _ = w.Write([]byte("ok"))
    })

    srv := &http.Server{Addr: ":9320", Handler: mux, ReadHeaderTimeout: 5 * time.Second}
    go func() {
        <-ctx.Done()
        shutCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()
        _ = srv.Shutdown(shutCtx)
    }()

    log.Println("video probe exporter listening on :9320")
    if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
        log.Fatalf("server: %v", err)
    }
}

Using modernc.org/sqlite (pure Go, no cgo) means the binary builds static and drops into a scratch container with nothing else — no shared libraries to chase. Then point Prometheus at it with a job block: job_name: tvh-video-probe, static_configs targeting the host on :9320, scrape_interval: 15s. Because every scrape is a memory read, that interval costs essentially nothing.

Alerts that mean something

Metrics without alerts are a dashboard nobody watches. These are the rules that map directly to the failure modes above:

Stale data (the original incident): tvh_video_data_age_seconds > 3 * 3600 held for 10m. Fires even while every page returns 200, because PHP never bumps last_success_at on failure.
Region down: tvh_video_probe_up == 0 for 5m.
Degraded playability: tvh_video_playable_ratio < 0.6 — a soft warning before a region flips fully down.
Rising deadness: rate(tvh_video_probes_total{result="dead"}[15m]) > 0.1 catches a CDN or upstream slowly shedding URLs.
Slow CDN: histogram_quantile(0.95, sum by (le, region) (rate(tvh_video_probe_duration_seconds_bucket[10m]))) > 3 warns when p95 probe latency crosses three seconds.

We route stale-data and region-down to a pager; the ratio and latency warnings go to a Slack channel.

Extending the pattern to search

The same shape covers the FTS5 regression I listed earlier. We added one more probe that runs a known CJK query — a fixed Japanese phrase we expect to always match content — through the FTS5 index and asserts a non-zero result count, exposing tvh_search_results. A migration that breaks the CJK tokenizer drops that gauge to zero, and we know before a user in Tokyo types into the search box. The probe is generic; the assertions are domain knowledge.

What it caught

In the first month this exporter caught a quota exhaustion in our Taiwan region (paged in four minutes, not eleven hours), a batch of region-blocked videos in Vietnam that had quietly grown to 14% of the sample, and one genuinely scary empty_manifest spike when an upstream packager misconfigured a segment path. None of those would have tripped an HTTP uptime check.

The broader takeaway is small and stubborn: monitor the product, not the process. CPU, memory, and a homepage ping describe a machine that is running. They say nothing about whether the thing your users came for actually works. For a video aggregator that means probing playability, freshness per region, and search — the three places 200 OK lies to you — and turning each into a metric a robot will page you about at 3 a.m. so a human in Seoul doesn't have to email you instead.

DEV Community