In Progress Updated Jan 10, 2026

Decoupling Completion from Correctness

Evidence-Gated Multi-Agent Code Generation Under Repository Constraints

LLM AgentsCode GenerationGovernanceSecurityEvaluation

Abstract

Modern LLM coding assistants are excellent at finishing requests, but real engineering failures happen when a “finished” change quietly violates a repository’s constraints: module boundaries, security posture, test contracts, and operational assumptions. This working draft proposes a governance-first architecture that makes repository fitness the stopping condition. It combines local grounding (HugeContext), an Agent–Auditor loop (HugeCode), and a deterministic Gatekeeper that enforces fitness functions via static analysis, tests, and policy checks.

Key Contributions

  • 1 Reframe completion as provisional and make repository fitness the stopping condition
  • 2 Governance-first architecture combining grounding, adversarial auditing, and deterministic gates
  • 3 HugeContext: local repository grounding for constraint-relevant evidence
  • 4 HugeCode: Agent–Auditor loop designed to resist completion bias and security shortcuts
  • 5 Gatekeeper: deterministic enforcement via tests, static analysis, and policy checks
  • 6 Evaluation design centered on real repository constraints and observed failure modes

Why This Matters

In production repositories, plausibly-correct code is not enough. The expensive failures come from drift: shortcuts that bypass conventions, security posture, or test contracts. This work proposes an evidence-gated workflow that keeps speed while making “done” contingent on repository fitness.

Overview

LLM coding assistants are great at outputting something that looks finished. But a production repository is not a blank page: it has constraints that rarely fit into a single prompt. When the assistant optimizes for “completion”, it tends to drift from the repository’s truth (conventions, dependencies, security posture, and test contracts).

This draft proposes a governance-first workflow where completion is provisional until verified against explicit fitness functions.

Architecture (High Level)

The proposed system has three cooperating parts:

  • HugeContext (Grounding): retrieves constraint-relevant evidence from the repository (module boundaries, patterns, policies) so changes are anchored in local truth.
  • HugeCode (Agent–Auditor Loop): generates candidates, then adversarially audits them to resist “consensus-by-completion” and surface hidden risks.
  • Gatekeeper (Deterministic): enforces repository fitness functions (tests, static analysis, policy checks) and blocks merges without evidence.

What “Evidence-Gated” Means

Instead of treating a natural-language answer as “done”, the system produces an evidence pack alongside the proposed change:

  • What invariants are being relied on
  • What repository evidence supports the design
  • What fitness functions were run and passed
  • What risks remain and how they’re mitigated

This shifts the stopping condition from “a coherent patch exists” to “the patch is repository-fit”.

Fitness Functions (Examples)

Gatekeeper is intentionally boring and deterministic. Typical checks include:

  • Unit/integration tests
  • Type checks and linting
  • Dependency and licensing policies
  • Secrets scanning and security linters
  • Repo-specific checks (conventions, build steps, CI workflows)

Status & Next Steps

This is a working draft. The benchmark harness and fully reproducible figures are in progress.

See the Updates section for progress notes.

Software Availability

Feedback

If you have examples of “looks correct but breaks the repo” failures, or you want to review the draft, contact me.

References

  1. Noy & Zhang et al.. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot
  2. Jimenez et al.. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
  3. OWASP Top 10 for Large Language Model Applications
  4. NCSC guidance on AI security

Updates

paper

Paper draft progressing

Made significant progress on the RAG evaluation framework paper. Case studies from enterprise deployments are coming together.

Get Notified on Release

Interested in early access, collaboration, or providing feedback on the draft? Reach out directly.

Contact Me