Observability¶
Hookaido provides structured logging, Prometheus metrics, and OpenTelemetry tracing, all configurable via the observability block.
Quick Start¶
observability {
access_log {
enabled on
output stderr
format json
}
runtime_log {
level info
output stderr
format json
}
metrics {
listen ":9900"
prefix "/metrics"
}
tracing {
enabled on
collector "https://otel.example.com/v1/traces"
}
}
Logging¶
Hookaido produces two log streams, both structured JSON:
Access Log¶
Per-request logs for ingress, Pull API, and Admin API.
Shorthand:
Block form:
observability {
access_log {
enabled on
output stderr # stdout, stderr, or file
path /var/log/hookaido/access.log # required when output=file
format json
}
}
Runtime Log¶
Application-level structured logs (startup, reload, errors, queue events).
Shorthand:
Block form:
observability {
runtime_log {
level info # debug, info, warn, error, off
output stderr # stdout, stderr, or file
path /var/log/hookaido/runtime.log
format json
}
}
Log Sinks¶
| Sink | Description |
|---|---|
stdout |
Standard output |
stderr |
Standard error (default) |
file |
File output (requires path) |
The --log-level CLI flag overrides the runtime log level from config.
Metrics¶
Prometheus-compatible metrics endpoint.
observability {
metrics {
listen ":9900" # default: 127.0.0.1:9900
prefix "/metrics" # default: /metrics
enabled on # explicitly enable/disable
}
}
Set enabled off to disable the metrics listener while keeping config in place.
Available Metrics¶
Queue metrics:
| Metric | Type | Description |
|---|---|---|
hookaido_queue_depth |
gauge | Current items by state (queued, leased, dead) |
hookaido_queue_total |
gauge | Current total items across all queue states |
hookaido_queue_oldest_queued_age_seconds |
gauge | Age of the oldest queued item in seconds |
hookaido_queue_ready_lag_seconds |
gauge | Ready lag of the earliest runnable queued item in seconds |
Ingress metrics:
| Metric | Type | Description |
|---|---|---|
hookaido_ingress_accepted_total |
counter | Ingress requests accepted and enqueued |
hookaido_ingress_rejected_total |
counter | Ingress requests rejected (auth, rate-limit, etc) |
hookaido_ingress_rejected_by_reason_total{reason,status} |
counter | Ingress rejects by normalized reason + status (includes memory_pressure with status 503) |
hookaido_ingress_enqueued_total |
counter | Items enqueued via ingress (>accepted if fanout) |
hookaido_ingress_adaptive_backpressure_total{reason} |
counter | Ingress requests rejected by adaptive backpressure (by trigger reason) |
hookaido_ingress_adaptive_backpressure_applied_total |
counter | Total ingress requests rejected by adaptive backpressure |
Delivery metrics:
| Metric | Type | Description |
|---|---|---|
hookaido_delivery_attempts_total |
counter | Total push delivery attempts |
hookaido_delivery_acked_total |
counter | Deliveries acknowledged (2xx) |
hookaido_delivery_retry_total |
counter | Deliveries scheduled for retry |
hookaido_delivery_dead_total |
counter | Deliveries moved to DLQ |
hookaido_delivery_dead_by_reason_total{reason} |
counter | DLQ transitions by normalized reason (max_retries, no_retry, policy_denied, unspecified, other) |
Pull metrics:
| Metric | Type | Description |
|---|---|---|
hookaido_pull_dequeue_total |
counter | Pull dequeue requests by route and status label (200, 204, 4xx, 5xx) |
hookaido_pull_acked_total |
counter | Successful Pull ack operations by route |
hookaido_pull_nacked_total |
counter | Successful Pull nack/mark-dead operations by route |
hookaido_pull_ack_conflict_total |
counter | Pull ack lease conflicts (409) by route |
hookaido_pull_nack_conflict_total |
counter | Pull nack lease conflicts (409) by route |
hookaido_pull_lease_active |
gauge | Active Pull leases currently tracked by route |
hookaido_pull_lease_expired_total |
counter | Lease expirations observed during Pull ack/nack/extend by route |
Store common metrics:
| Metric | Type | Description |
|---|---|---|
hookaido_store_operation_seconds{backend,operation} |
histogram | Store operation duration by backend and operation |
hookaido_store_operation_total{backend,operation} |
counter | Store operation totals by backend and operation |
hookaido_store_errors_total{backend,operation,kind} |
counter | Store operation errors by backend, operation, and normalized kind |
The common families are emitted by all first-party queue backends (sqlite, memory, postgres).
Store/SQLite compatibility metrics (sqlite backend):
| Metric | Type | Description |
|---|---|---|
hookaido_store_sqlite_write_seconds |
histogram | SQLite write transaction duration (queue mutation paths) |
hookaido_store_sqlite_dequeue_seconds |
histogram | SQLite dequeue transaction duration |
hookaido_store_sqlite_checkpoint_seconds |
histogram | SQLite WAL checkpoint duration (periodic passive checkpoints) |
hookaido_store_sqlite_busy_total |
counter | SQLite busy/locked errors observed in instrumented paths |
hookaido_store_sqlite_retry_total |
counter | SQLite begin-transaction retry attempts after busy/locked errors |
hookaido_store_sqlite_tx_commit_total |
counter | Committed SQLite transactions in instrumented queue paths |
hookaido_store_sqlite_tx_rollback_total |
counter | Rolled-back SQLite transactions in instrumented queue paths |
hookaido_store_sqlite_checkpoint_total |
counter | Successful periodic SQLite WAL checkpoints |
hookaido_store_sqlite_checkpoint_errors_total |
counter | Failed periodic SQLite WAL checkpoints |
Store/Memory metrics (memory backend):
| Metric | Type | Description |
|---|---|---|
hookaido_store_memory_items{state} |
gauge | Current in-memory item count by state (queued, leased, delivered, dead) |
hookaido_store_memory_retained_bytes{state} |
gauge | Estimated retained bytes by state (queued, leased, delivered, dead) |
hookaido_store_memory_retained_bytes_total |
gauge | Estimated total retained bytes in memory store |
hookaido_store_memory_evictions_total{reason} |
counter | Memory-store evictions by reason (drop_oldest, retention evictions, etc.) |
Backend Metric Expectations¶
Use backend-agnostic metric families as the default dashboard and alert base:
hookaido_store_operation_seconds{backend,operation}hookaido_store_operation_total{backend,operation}hookaido_store_errors_total{backend,operation,kind}hookaido_delivery_dead_by_reason_total{reason}
Backend-specific coverage:
| Backend | Required/common store metrics | Backend-specific metrics | Notes |
|---|---|---|---|
sqlite |
hookaido_store_operation_*, hookaido_store_errors_total |
hookaido_store_sqlite_* |
hookaido_store_sqlite_* is compatibility/debug surface for SQLite internals. |
memory |
hookaido_store_operation_*, hookaido_store_errors_total |
hookaido_store_memory_* |
hookaido_store_sqlite_* is intentionally absent. |
postgres |
hookaido_store_operation_*, hookaido_store_errors_total |
none (store internals exposed via common families) | hookaido_store_sqlite_* and hookaido_store_memory_* are intentionally absent. |
Migration guidance:
- Prefer common store metric families for SLOs, saturation alerts, and cross-backend dashboards.
- Keep
hookaido_store_sqlite_*for SQLite-only deep diagnostics (for example WAL/checkpoint lock analysis). - Treat missing backend-specific series on other backends as "not emitted", not as zero or failure.
PromQL Examples (Backend-Aware)¶
Store p95 by backend and operation:
histogram_quantile(
0.95,
sum by (backend, operation, le) (
rate(hookaido_store_operation_seconds_bucket[5m])
)
)
Store error rate by backend, operation, and kind:
Backend-specific store throughput comparison:
DLQ growth by dead reason:
Alert example (backend-aware store error burst):
Publish metrics:
| Metric | Type | Description |
|---|---|---|
hookaido_publish_accepted_total |
counter | Accepted publish mutations |
hookaido_publish_rejected_total |
counter | Rejected publish mutations |
hookaido_publish_rejected_validation_total |
counter | Rejections: validation errors |
hookaido_publish_rejected_policy_total |
counter | Rejections: policy violations |
hookaido_publish_rejected_conflict_total |
counter | Rejections: duplicate IDs |
hookaido_publish_rejected_queue_full_total |
counter | Rejections: queue at capacity |
hookaido_publish_rejected_store_total |
counter | Rejections: store errors |
hookaido_publish_scoped_accepted_total |
counter | Accepted scoped (managed) publish |
hookaido_publish_scoped_rejected_total |
counter | Rejected scoped (managed) publish |
Tracing diagnostics:
| Metric | Type | Description |
|---|---|---|
hookaido_tracing_enabled |
gauge | Whether tracing is configured |
hookaido_tracing_init_failures_total |
counter | Tracing initialization failures |
hookaido_tracing_export_errors_total |
counter | Tracing export errors |
Compatibility/version metrics:
| Metric | Type | Description |
|---|---|---|
hookaido_build_info{version=...} |
gauge | Process version label for dashboard/version gating |
hookaido_metrics_schema_info{schema=...} |
gauge | Metrics schema version label for compatibility guards |
Tracing¶
OpenTelemetry OTLP/HTTP traces for request-level observability. HTTP servers (ingress, Pull API, Admin API) and the outbound push dispatcher client are instrumented.
Minimal Config¶
Full Config¶
observability {
tracing {
enabled on
collector "https://otel.example.com/v1/traces"
url_path "/v1/traces"
timeout "10s"
compression gzip # none or gzip
insecure off # allow plain HTTP (dev only)
# TLS options
tls {
ca_file /path/to/ca.pem
cert_file /path/to/cert.pem
key_file /path/to/key.pem
server_name "otel.example.com"
insecure_skip_verify off
}
# Proxy
proxy_url "http://proxy.internal:3128"
# Retry on export failure
retry {
enabled on
initial_interval "5s"
max_interval "30s"
max_elapsed_time "1m"
}
# Custom headers (e.g., for auth)
header "Authorization" "Bearer otel-token"
header "X-Custom-Header" "value"
}
}
| Directive | Default | Description |
|---|---|---|
enabled |
off |
Enable/disable tracing |
collector |
— | OTLP/HTTP collector endpoint |
url_path |
/v1/traces |
URL path on the collector |
timeout |
10s |
Export timeout |
compression |
none |
none or gzip |
insecure |
off |
Allow HTTP (non-TLS) transport |
proxy_url |
— | HTTP proxy for exporter |
tls.ca_file |
— | CA certificate file for TLS |
tls.cert_file |
— | Client certificate file for mTLS |
tls.key_file |
— | Client key file for mTLS |
tls.server_name |
— | Override TLS server name |
tls.insecure_skip_verify |
off |
Skip TLS certificate verification |
retry.enabled |
off |
Retry failed exports |
retry.initial_interval |
— | First retry delay |
retry.max_interval |
— | Maximum retry delay |
retry.max_elapsed_time |
— | Total retry time budget |
header |
— | Custom HTTP headers (repeatable) |
Header entries must be valid HTTP header name/value pairs. Invalid entries fail config validation.
Health Diagnostics¶
The Admin API health endpoint (GET /healthz?details=1) aggregates observability data:
- Queue state rollups with age/lag indicators
- Backlog trend signals with operator action playbooks
- Tracing counters (init failures, export errors)
- Ingress adaptive-backpressure diagnostics (
adaptive_backpressure_applied_total,adaptive_backpressure_by_reason) and rejection reason counters (rejected_by_reason, includingmemory_pressure) - Delivery diagnostics include dead-letter reason breakdown (
dead_by_reason) for DLQ growth attribution - Memory-store diagnostics (when backend is
memory):items_by_state, retained bytes, eviction counters, andmemory_pressurestatus/limits/reject counters - Top route/target backlog buckets
- Queue diagnostics are cached (short TTL) and served stale-while-refresh under heavy load to keep control-plane endpoints responsive.
Operational guidance for control-plane responsiveness:
- Keep SLO probes on
GET /healthz(without details) for the fastest liveness path. - Use
GET /healthz?details=1andGET /metricsfor diagnostics/monitoring; under queue saturation these endpoints prioritize bounded latency over strictly real-time queue snapshots.
Saturation Notes¶
Queue saturation analysis showed one hot path in the ingest/admission write flow: with queue_limits.max_depth enabled, each enqueue previously executed COUNT(*) over active queue states (queued + leased) inside a write transaction.
At high occupancy, that repeated count increased write transaction time and lock contention (hookaido_store_sqlite_write_seconds, hookaido_store_sqlite_busy_total, hookaido_store_sqlite_retry_total).
Hookaido now maintains O(1) active-depth counters (queue_counters) via SQLite triggers and uses them for max_depth admission checks.
For memory backend deployments, Hookaido also applies a retained-footprint pressure guard and emits memory_pressure ingress reject reasons before hard process failure risk.
To validate improvements in your environment, compare before/after load runs using:
- p95/p99 ingress latency and 503 rate
- hookaido_store_sqlite_write_seconds histogram shape
- hookaido_store_sqlite_busy_total and hookaido_store_sqlite_retry_total growth rate
See Admin API for details.
Adaptive Backpressure Tuning¶
Use the dedicated production runbook in Adaptive Backpressure Tuning.
Key principle:
- defaults.adaptive_backpressure should react before hard queue_limits.max_depth pressure, not after.
Use these series together:
- hookaido_ingress_adaptive_backpressure_total{reason}
- hookaido_ingress_rejected_by_reason_total{reason,status}
- ingress latency p95/p99 from HTTP telemetry
Dashboard Compatibility Notes¶
When dashboards span mixed Hookaido versions (for example v1.2.x and v1.3.x), treat missing metrics as "not emitted" rather than zero:
- Gate rules and panels by
hookaido_metrics_schema_info{schema="1.3.0"} == 1(orhookaido_build_infoversion labels). - In PromQL, prefer compatibility-safe expressions (for example
metric OR on() vector(0)) where appropriate. - Document minimum supported Hookaido version per dashboard bundle to avoid false "all good" signals from absent series.
Audit Logging¶
All Admin API and MCP mutations emit structured JSONL audit events (to stderr or configured runtime log):
{
"timestamp": "2026-02-09T10:00:00Z",
"principal": "ops@example.test",
"role": "operate",
"tool": "messages_publish",
"input_hash": "sha256:abc...",
"result": "ok",
"duration_ms": 42,
"metadata": { ... }
}
Audit metadata varies by operation:
- Config mutations:
config_mutation(operation, mode, outcome) - Runtime control:
runtime_control(operation, outcome) - ID-based mutations:
id_mutation(operation, IDs requested/unique/changed) - Filter mutations:
filter_mutation(operation, matched/changed, preview flag) - Publish:
admin_proxy_publish(rollback counters, if Admin-proxy mode)