Skip to content

Observability

Hookaido provides structured logging, Prometheus metrics, and OpenTelemetry tracing, all configurable via the observability block.

Quick Start

observability {
  access_log {
    enabled on
    output stderr
    format json
  }

  runtime_log {
    level info
    output stderr
    format json
  }

  metrics {
    listen ":9900"
    prefix "/metrics"
  }

  tracing {
    enabled on
    collector "https://otel.example.com/v1/traces"
  }
}

Logging

Hookaido produces two log streams, both structured JSON:

Access Log

Per-request logs for ingress, Pull API, and Admin API.

Shorthand:

observability {
  access_log on    # enable to stderr with JSON format
}

Block form:

observability {
  access_log {
    enabled on
    output stderr       # stdout, stderr, or file
    path /var/log/hookaido/access.log   # required when output=file
    format json
  }
}

Runtime Log

Application-level structured logs (startup, reload, errors, queue events).

Shorthand:

observability {
  runtime_log info    # level as shorthand: debug, info, warn, error, off
}

Block form:

observability {
  runtime_log {
    level info         # debug, info, warn, error, off
    output stderr      # stdout, stderr, or file
    path /var/log/hookaido/runtime.log
    format json
  }
}

Log Sinks

Sink Description
stdout Standard output
stderr Standard error (default)
file File output (requires path)

The --log-level CLI flag overrides the runtime log level from config.

Metrics

Prometheus-compatible metrics endpoint.

observability {
  metrics {
    listen ":9900"           # default: 127.0.0.1:9900
    prefix "/metrics"        # default: /metrics
    enabled on               # explicitly enable/disable
  }
}

Set enabled off to disable the metrics listener while keeping config in place.

Available Metrics

Queue metrics:

Metric Type Description
hookaido_queue_depth gauge Current items by state (queued, leased, dead)
hookaido_queue_total gauge Current total items across all queue states
hookaido_queue_oldest_queued_age_seconds gauge Age of the oldest queued item in seconds
hookaido_queue_ready_lag_seconds gauge Ready lag of the earliest runnable queued item in seconds

Ingress metrics:

Metric Type Description
hookaido_ingress_accepted_total counter Ingress requests accepted and enqueued
hookaido_ingress_rejected_total counter Ingress requests rejected (auth, rate-limit, etc)
hookaido_ingress_rejected_by_reason_total{reason,status} counter Ingress rejects by normalized reason + status (includes memory_pressure with status 503)
hookaido_ingress_enqueued_total counter Items enqueued via ingress (>accepted if fanout)
hookaido_ingress_adaptive_backpressure_total{reason} counter Ingress requests rejected by adaptive backpressure (by trigger reason)
hookaido_ingress_adaptive_backpressure_applied_total counter Total ingress requests rejected by adaptive backpressure

Delivery metrics:

Metric Type Description
hookaido_delivery_attempts_total counter Total push delivery attempts
hookaido_delivery_acked_total counter Deliveries acknowledged (2xx)
hookaido_delivery_retry_total counter Deliveries scheduled for retry
hookaido_delivery_dead_total counter Deliveries moved to DLQ
hookaido_delivery_dead_by_reason_total{reason} counter DLQ transitions by normalized reason (max_retries, no_retry, policy_denied, unspecified, other)

Pull metrics:

Metric Type Description
hookaido_pull_dequeue_total counter Pull dequeue requests by route and status label (200, 204, 4xx, 5xx)
hookaido_pull_acked_total counter Successful Pull ack operations by route
hookaido_pull_nacked_total counter Successful Pull nack/mark-dead operations by route
hookaido_pull_ack_conflict_total counter Pull ack lease conflicts (409) by route
hookaido_pull_nack_conflict_total counter Pull nack lease conflicts (409) by route
hookaido_pull_lease_active gauge Active Pull leases currently tracked by route
hookaido_pull_lease_expired_total counter Lease expirations observed during Pull ack/nack/extend by route

Store common metrics:

Metric Type Description
hookaido_store_operation_seconds{backend,operation} histogram Store operation duration by backend and operation
hookaido_store_operation_total{backend,operation} counter Store operation totals by backend and operation
hookaido_store_errors_total{backend,operation,kind} counter Store operation errors by backend, operation, and normalized kind

The common families are emitted by all first-party queue backends (sqlite, memory, postgres).

Store/SQLite compatibility metrics (sqlite backend):

Metric Type Description
hookaido_store_sqlite_write_seconds histogram SQLite write transaction duration (queue mutation paths)
hookaido_store_sqlite_dequeue_seconds histogram SQLite dequeue transaction duration
hookaido_store_sqlite_checkpoint_seconds histogram SQLite WAL checkpoint duration (periodic passive checkpoints)
hookaido_store_sqlite_busy_total counter SQLite busy/locked errors observed in instrumented paths
hookaido_store_sqlite_retry_total counter SQLite begin-transaction retry attempts after busy/locked errors
hookaido_store_sqlite_tx_commit_total counter Committed SQLite transactions in instrumented queue paths
hookaido_store_sqlite_tx_rollback_total counter Rolled-back SQLite transactions in instrumented queue paths
hookaido_store_sqlite_checkpoint_total counter Successful periodic SQLite WAL checkpoints
hookaido_store_sqlite_checkpoint_errors_total counter Failed periodic SQLite WAL checkpoints

Store/Memory metrics (memory backend):

Metric Type Description
hookaido_store_memory_items{state} gauge Current in-memory item count by state (queued, leased, delivered, dead)
hookaido_store_memory_retained_bytes{state} gauge Estimated retained bytes by state (queued, leased, delivered, dead)
hookaido_store_memory_retained_bytes_total gauge Estimated total retained bytes in memory store
hookaido_store_memory_evictions_total{reason} counter Memory-store evictions by reason (drop_oldest, retention evictions, etc.)

Backend Metric Expectations

Use backend-agnostic metric families as the default dashboard and alert base:

  • hookaido_store_operation_seconds{backend,operation}
  • hookaido_store_operation_total{backend,operation}
  • hookaido_store_errors_total{backend,operation,kind}
  • hookaido_delivery_dead_by_reason_total{reason}

Backend-specific coverage:

Backend Required/common store metrics Backend-specific metrics Notes
sqlite hookaido_store_operation_*, hookaido_store_errors_total hookaido_store_sqlite_* hookaido_store_sqlite_* is compatibility/debug surface for SQLite internals.
memory hookaido_store_operation_*, hookaido_store_errors_total hookaido_store_memory_* hookaido_store_sqlite_* is intentionally absent.
postgres hookaido_store_operation_*, hookaido_store_errors_total none (store internals exposed via common families) hookaido_store_sqlite_* and hookaido_store_memory_* are intentionally absent.

Migration guidance:

  • Prefer common store metric families for SLOs, saturation alerts, and cross-backend dashboards.
  • Keep hookaido_store_sqlite_* for SQLite-only deep diagnostics (for example WAL/checkpoint lock analysis).
  • Treat missing backend-specific series on other backends as "not emitted", not as zero or failure.

PromQL Examples (Backend-Aware)

Store p95 by backend and operation:

histogram_quantile(
  0.95,
  sum by (backend, operation, le) (
    rate(hookaido_store_operation_seconds_bucket[5m])
  )
)

Store error rate by backend, operation, and kind:

sum by (backend, operation, kind) (
  rate(hookaido_store_errors_total[5m])
)

Backend-specific store throughput comparison:

sum by (backend, operation) (
  rate(hookaido_store_operation_total[5m])
)

DLQ growth by dead reason:

sum by (reason) (
  increase(hookaido_delivery_dead_by_reason_total[15m])
)

Alert example (backend-aware store error burst):

sum by (backend) (
  rate(hookaido_store_errors_total[5m])
) > 0

Publish metrics:

Metric Type Description
hookaido_publish_accepted_total counter Accepted publish mutations
hookaido_publish_rejected_total counter Rejected publish mutations
hookaido_publish_rejected_validation_total counter Rejections: validation errors
hookaido_publish_rejected_policy_total counter Rejections: policy violations
hookaido_publish_rejected_conflict_total counter Rejections: duplicate IDs
hookaido_publish_rejected_queue_full_total counter Rejections: queue at capacity
hookaido_publish_rejected_store_total counter Rejections: store errors
hookaido_publish_scoped_accepted_total counter Accepted scoped (managed) publish
hookaido_publish_scoped_rejected_total counter Rejected scoped (managed) publish

Tracing diagnostics:

Metric Type Description
hookaido_tracing_enabled gauge Whether tracing is configured
hookaido_tracing_init_failures_total counter Tracing initialization failures
hookaido_tracing_export_errors_total counter Tracing export errors

Compatibility/version metrics:

Metric Type Description
hookaido_build_info{version=...} gauge Process version label for dashboard/version gating
hookaido_metrics_schema_info{schema=...} gauge Metrics schema version label for compatibility guards

Tracing

OpenTelemetry OTLP/HTTP traces for request-level observability. HTTP servers (ingress, Pull API, Admin API) and the outbound push dispatcher client are instrumented.

Minimal Config

observability {
  tracing {
    enabled on
    collector "https://otel.example.com/v1/traces"
  }
}

Full Config

observability {
  tracing {
    enabled on
    collector "https://otel.example.com/v1/traces"
    url_path "/v1/traces"
    timeout "10s"
    compression gzip           # none or gzip
    insecure off               # allow plain HTTP (dev only)

    # TLS options
    tls {
      ca_file /path/to/ca.pem
      cert_file /path/to/cert.pem
      key_file /path/to/key.pem
      server_name "otel.example.com"
      insecure_skip_verify off
    }

    # Proxy
    proxy_url "http://proxy.internal:3128"

    # Retry on export failure
    retry {
      enabled on
      initial_interval "5s"
      max_interval "30s"
      max_elapsed_time "1m"
    }

    # Custom headers (e.g., for auth)
    header "Authorization" "Bearer otel-token"
    header "X-Custom-Header" "value"
  }
}
Directive Default Description
enabled off Enable/disable tracing
collector OTLP/HTTP collector endpoint
url_path /v1/traces URL path on the collector
timeout 10s Export timeout
compression none none or gzip
insecure off Allow HTTP (non-TLS) transport
proxy_url HTTP proxy for exporter
tls.ca_file CA certificate file for TLS
tls.cert_file Client certificate file for mTLS
tls.key_file Client key file for mTLS
tls.server_name Override TLS server name
tls.insecure_skip_verify off Skip TLS certificate verification
retry.enabled off Retry failed exports
retry.initial_interval First retry delay
retry.max_interval Maximum retry delay
retry.max_elapsed_time Total retry time budget
header Custom HTTP headers (repeatable)

Header entries must be valid HTTP header name/value pairs. Invalid entries fail config validation.

Health Diagnostics

The Admin API health endpoint (GET /healthz?details=1) aggregates observability data:

  • Queue state rollups with age/lag indicators
  • Backlog trend signals with operator action playbooks
  • Tracing counters (init failures, export errors)
  • Ingress adaptive-backpressure diagnostics (adaptive_backpressure_applied_total, adaptive_backpressure_by_reason) and rejection reason counters (rejected_by_reason, including memory_pressure)
  • Delivery diagnostics include dead-letter reason breakdown (dead_by_reason) for DLQ growth attribution
  • Memory-store diagnostics (when backend is memory): items_by_state, retained bytes, eviction counters, and memory_pressure status/limits/reject counters
  • Top route/target backlog buckets
  • Queue diagnostics are cached (short TTL) and served stale-while-refresh under heavy load to keep control-plane endpoints responsive.

Operational guidance for control-plane responsiveness:

  • Keep SLO probes on GET /healthz (without details) for the fastest liveness path.
  • Use GET /healthz?details=1 and GET /metrics for diagnostics/monitoring; under queue saturation these endpoints prioritize bounded latency over strictly real-time queue snapshots.

Saturation Notes

Queue saturation analysis showed one hot path in the ingest/admission write flow: with queue_limits.max_depth enabled, each enqueue previously executed COUNT(*) over active queue states (queued + leased) inside a write transaction.

At high occupancy, that repeated count increased write transaction time and lock contention (hookaido_store_sqlite_write_seconds, hookaido_store_sqlite_busy_total, hookaido_store_sqlite_retry_total).

Hookaido now maintains O(1) active-depth counters (queue_counters) via SQLite triggers and uses them for max_depth admission checks.

For memory backend deployments, Hookaido also applies a retained-footprint pressure guard and emits memory_pressure ingress reject reasons before hard process failure risk.

To validate improvements in your environment, compare before/after load runs using: - p95/p99 ingress latency and 503 rate - hookaido_store_sqlite_write_seconds histogram shape - hookaido_store_sqlite_busy_total and hookaido_store_sqlite_retry_total growth rate

See Admin API for details.

Adaptive Backpressure Tuning

Use the dedicated production runbook in Adaptive Backpressure Tuning.

Key principle: - defaults.adaptive_backpressure should react before hard queue_limits.max_depth pressure, not after.

Use these series together: - hookaido_ingress_adaptive_backpressure_total{reason} - hookaido_ingress_rejected_by_reason_total{reason,status} - ingress latency p95/p99 from HTTP telemetry

Dashboard Compatibility Notes

When dashboards span mixed Hookaido versions (for example v1.2.x and v1.3.x), treat missing metrics as "not emitted" rather than zero:

  • Gate rules and panels by hookaido_metrics_schema_info{schema="1.3.0"} == 1 (or hookaido_build_info version labels).
  • In PromQL, prefer compatibility-safe expressions (for example metric OR on() vector(0)) where appropriate.
  • Document minimum supported Hookaido version per dashboard bundle to avoid false "all good" signals from absent series.

Audit Logging

All Admin API and MCP mutations emit structured JSONL audit events (to stderr or configured runtime log):

{
  "timestamp": "2026-02-09T10:00:00Z",
  "principal": "ops@example.test",
  "role": "operate",
  "tool": "messages_publish",
  "input_hash": "sha256:abc...",
  "result": "ok",
  "duration_ms": 42,
  "metadata": { ... }
}

Audit metadata varies by operation:

  • Config mutations: config_mutation (operation, mode, outcome)
  • Runtime control: runtime_control (operation, outcome)
  • ID-based mutations: id_mutation (operation, IDs requested/unique/changed)
  • Filter mutations: filter_mutation (operation, matched/changed, preview flag)
  • Publish: admin_proxy_publish (rollback counters, if Admin-proxy mode)

Documentation Index