This document explains how the swarm submits proofs, why a submission can get stranded, and the self-healing machinery that recovers stranded proofs without losing work or re-proving anything. It is the operator’s map for the queued-proof flow and the Gate A capacity backstop.
A proving agent does not open a PR per proof. After it verifies a proof
locally, it pushes a queued/prove/<goal>/<agent-id>-<hex> branch (the default
UNSORRY_SUBMIT_MODE=queue) and stops. A separate dispatcher turns those
queued branches into PRs, rate-limited by a submission governor (ADR-058,
SPEC-007-A):
agent --prove ──▶ queued/prove/* branch (locally verified, no PR yet)
│
queue-dispatcher (scheduled, governor-metered)
│
▼
gh pr create + auto-merge ──▶ Gate A re-verifies ──▶ merge
The dispatcher only re-packages an existing branch into a PR — it never re-proves. Gate A re-verifies every proof from scratch under its pinned toolchain, so nothing about soundness is trusted from the producer; the queue is purely a throughput/ordering mechanism.
The governor (ADR-058) bounds how much verifier work is in flight:
| Knob | Default | Meaning |
|---|---|---|
UNSORRY_MAX_OPEN_PROVE_PRS |
40 | Pause dispatch when this many open prove( PRs already exist. |
UNSORRY_MAX_GATE_A_IN_FLIGHT |
20 | Pause dispatch when queued + in-progress Gate A runs reach this. |
UNSORRY_DISPATCH_LIMIT |
1 (10 in the workflow) | Max branches dispatched per pass. |
So a flood of queued work drains in order at Gate A’s pace, instead of swamping the runners all at once.
Two failure modes put a valid proof outside the queue:
prove(...) submission after the cutover. Before the queued
dispatcher existed, agents opened proof PRs directly (feature/goal-*,
prove/*, or a prove(...) title). Those are no longer accepted: repository
admission control (tools/repo/pr_admission.py,
pr-admission workflow) labels a post-cutover direct submission
blocked-direct-submit and closes it.gate-a-replay / gate-a-audit jobs (≈1h each) queue indefinitely behind the
runner cap, and the open-PR count parks the dispatcher’s governor above its
pause threshold. The result is a deadlock: the direct PRs never verify, and
the governor stays paused so the queue never drains either. This is the
#1904 “ruv sledgehammer” case and the capacity issue #1909.Four pieces turn a stranded proof back into a landing one. No proof is ever lost: closing a PR never deletes its branch or commits, and the recovered proof is copied onto the queue before anything is closed.
When admission control closes a direct submission, its comment now offers the re-route command (not just “re-prove”), so a contributor’s direct-submission campaign self-heals into the metered queue instead of stranding.
reroute_stranded.py — re-package without re-provingtools/repo/reroute_stranded.py copies a
stranded PR’s proof files (library/Unsorry/<Camel>.lean,
library/index/<sha>.aisp, goals/<goal>.aisp) onto a fresh
queued/prove/<goal>/reroute-<hex> branch off the current main,
Gate-B-validates the tree, and pushes it:
python3 -m tools.repo.reroute_stranded --pr <n> --push
A self-contained proof (import Mathlib) re-routes cleanly; one that imports a
now-archived Unsorry.* module fails Gate A on dispatch — the safe outcome (it
genuinely no longer builds on current main).
queue-dispatcher — drain the queue automaticallyThe queue-dispatcher workflow runs
swarm/agent.sh --dispatch-queue --once every 15 minutes: it opens a
governor-metered PR for each queued branch and arms auto-merge, so the backlog
(including re-routed proofs) drains without an operator running the dispatcher by
hand. It self-throttles — it dispatches only while in-flight Gate A work is below
the cap and no-ops when full.
It authenticates with the admin REFRESH_TOKEN secret (a PR opened by the
default GITHUB_TOKEN does not trigger Gate A, so it could never merge); when
the secret is unset the job degrades to a report-only notice.
close-superseded — retire the stranded originalstools/repo/close_superseded.py and the
close-superseded workflow close a
stranded direct PR once its goal is proved on main (i.e. its re-route, or a
peer, landed) — and never a PR whose goal is still open. So as the dispatcher
lands re-routed proofs, the superseded originals retire automatically and the
backlog shrinks instead of doubling.
python3 -m tools.repo.close_superseded [--author <login>] [--dry-run]
direct PR stranded ─▶ admission offers re-route ─▶ reroute_stranded.py ─▶ queued/prove/*
│
queue-dispatcher (15m, metered) │
▼
PR ─▶ Gate A re-verifies ─▶ merge ─▶ goal proved
│
close-superseded (hourly) ◀────────────┘
retires the stranded original
Re-routing and dispatching only order the work — they do not create verifier
capacity. The rate-limiting step is Gate A itself: gate-a-replay /
gate-a-audit take ≈1h, and the governor keeps ≤20 in flight, so a deep queue
drains over hours/days. The durable fix is verification throughput — more / larger
runners, or sharding the replay/audit pole — tracked in #1909 (ADR-058
runner-pool segmentation). Per-agent submission quotas (ADR-053/ADR-054) are
the complementary lever so one agent cannot park hundreds of PRs at once.
| Task | Command / setting |
|---|---|
| Enable the scheduled dispatcher | set the REFRESH_TOKEN secret (admin PAT/App) |
| Run a local prover when the scheduled dispatcher is on | ./swarm/supervise.sh --prove (not run.sh — that would start a second dispatcher; see ADR-064 dedup) |
| Dispatch one pass by hand | ./swarm/agent.sh --dispatch-queue --once |
| Re-route one stranded PR | python3 -m tools.repo.reroute_stranded --pr <n> --push |
| Close superseded originals | python3 -m tools.repo.close_superseded [--author <l>] [--dry-run] |
| Loosen / tighten the drain | UNSORRY_DISPATCH_LIMIT, UNSORRY_MAX_OPEN_PROVE_PRS, UNSORRY_MAX_GATE_A_IN_FLIGHT |
| Emergency pause | UNSORRY_SUBMISSION_FREEZE=1 |