Adaptive Backpressure Tuning¶

This guide provides a production-focused tuning workflow for defaults.adaptive_backpressure.

Goal: - stabilize ingress tail latency (p95/p99) before hard queue saturation - reduce hard queue_full rejects (503) without over-throttling ingress

Prerequisites¶

Enable metrics and collect:

hookaido_queue_total
hookaido_queue_ready_lag_seconds
hookaido_queue_oldest_queued_age_seconds
hookaido_ingress_adaptive_backpressure_applied_total
hookaido_ingress_adaptive_backpressure_total{reason}
hookaido_ingress_rejected_by_reason_total{reason,status}

Also capture ingress request latency percentiles from your HTTP telemetry (p95, p99).

v1.5 Decision (2026-02-14)¶

Project decision for v1.5:

Runtime default remains defaults.adaptive_backpressure.enabled off.
This is intentional: adaptive pressure is a workload/SLO policy lever, not a universal safe default.
Recommended opt-in starting profile for enterprise mixed workloads:

defaults {
  adaptive_backpressure {
    enabled on
    min_total 400
    queued_percent 88
    ready_lag 45s
    oldest_queued_age 90s
    sustained_growth on
  }
}

Rationale:

Mixed A/B saturation runs show adaptive mode can shift rejects away from hard queue_full and improve lag/age behavior.
The same runs can also shift tail latency/throughput, so rollout must be SLO-driven.
Single-host benchmark hardware is suitable for relative same-host A/B comparisons, but not as a universal default reference across environments.

Starting Profiles¶

Pick one profile and adjust from there.

Profile	Use when	`min_total`	`queued_percent`	`ready_lag`	`oldest_queued_age`	`sustained_growth`
`balanced`	default for most teams	`200`	`80`	`30s`	`60s`	`on`
`latency_first`	strict p95/p99 protection	`100`	`72`	`15s`	`30s`	`on`
`throughput_first`	fewer early rejects, more queue elasticity	`400`	`88`	`45s`	`90s`	`on`

Example (latency_first):

defaults {
  adaptive_backpressure {
    enabled on
    min_total 100
    queued_percent 72
    ready_lag 15s
    oldest_queued_age 30s
    sustained_growth on
  }
}

Tuning Loop¶

Run each step with production-like traffic for at least 15-30 minutes.

Set one starting profile.
Observe latency (p95/p99), adaptive rejects, and hard queue rejects.
Change one parameter at a time.
Keep the change only if tail latency improves without unacceptable throughput loss.

Decision Matrix¶

If p95/p99 rises and queue_full rejects increase before adaptive rejects appear: - lower queued_percent by 3-5 - lower ready_lag by 5-10s - lower oldest_queued_age by 10-20s - optionally lower min_total by 50-100

If adaptive rejects are high but queue lag/age stay low: - raise queued_percent by 3-5 - raise ready_lag by 5-10s - raise oldest_queued_age by 10-20s

If reason="sustained_growth" dominates during normal bursts: - keep sustained_growth on - raise min_total gradually so short bursts do not trigger early

Guardrails¶

Keep hard queue_limits.max_depth as the final safety boundary.
Avoid large jumps; use small deltas and re-measure.
In bursty enterprise workloads, keep sustained_growth on unless you have strong evidence to disable it.
Treat adaptive rejects as an intentional control signal, not only as an error.

Benchmark Alignment¶

Use Pull benchmarks (make bench-pull*, make bench-pull-mixed*) to compare internal queue/drain optimization slices.

For reproducible adaptive off vs on saturation evidence (mixed by default, optional pull reference; with final metrics/health artifacts and comparison tables), use make adaptive-ab. If your host does not reach sustained pressure with defaults, use make adaptive-ab-mixed-saturation. For mixed contention tracking (#55), include hookaido_pull_ack_conflict_total / hookaido_pull_acked_total trend checks in the same A/B slice.

Use production-like ingress load tests for threshold tuning decisions. The benchmark suite is best for relative runtime regressions/improvements, while threshold tuning depends on workload shape and SLO targets.

Dashboard Compatibility¶

When dashboards span versions, gate adaptive panels/alerts with:

hookaido_metrics_schema_info{schema="1.3.0"} == 1

This avoids interpreting missing pre-1.3.0 series as zero.