跳转到主要内容

I Run a Fleet of AI Coding Agents Across ~1M Lines of Code — Here's the Architecture That Works in 2026

First-hand architecture for running specialized AI coding agents in production: persistent memory systems, deploy gates, lifecycle hooks, and the hard lessons — agents forget, agents claim success they haven't verified, and guardrails are the actual product.

Muhammad Amir

Muhammad Amir

Electrical Engineer & Founder, ECOSIRE Holdings

June 4, 20269 min read2.0k words

The Short Answer

Yes, AI coding agents can maintain a real production codebase — mine work across roughly a million lines spanning a Turborepo monorepo (NestJS, Next.js, PostgreSQL), 215+ ERP modules, and several SaaS platforms. But the model is maybe 30% of the system. The other 70% is architecture around the model: a persistent memory layer so agents stop relearning the same lessons, specialized agents with scoped permissions instead of one god-agent, mechanical deploy gates that don't trust any agent's claim of success, and lifecycle hooks that audit work automatically. Without that scaffolding, agents are brilliant interns with amnesia and overconfidence. With it, they ship production code daily.

This is the architecture I actually run, the failure modes I hit, and what I would build first if starting today. No vendor pitch — these are operating notes.

The Context

I am an Electrical Engineer who started in JF-17 fighter jet avionics, and I now run ECOSIRE Holdings — an Odoo module business, an e-commerce platform, an autonomous SEO SaaS, managed hosting, and ad-tech systems. The combined codebase across these properties is on the order of a million lines. There is no version of this where a small team maintains that surface area by hand. The agent fleet is not a toy or an experiment; it is how the work gets done.

What follows is organized as the four layers that turned out to matter, in the order I learned they mattered.

Layer 1: Specialization Beats One Super-Agent

My first instinct was one powerful general agent with access to everything. It failed in a predictable way: a generalist agent with a million lines in scope spends most of its effort rediscovering context, and its mistakes are unscoped — a wrong assumption about the database layer leaks into frontend work.

What works is a fleet of specialized agents, each with three things pinned down:

  1. A narrow role. I run about nineteen: frontend engineer, backend engineer, database engineer, security auditor, performance auditor, test engineer, i18n specialist, devops, documentation writer, and so on.
  2. Scoped tool permissions. The security auditor can read everything but edit nothing — it produces findings, not patches. The frontend engineer can edit code but works in an isolated git worktree so its changes cannot contaminate the main tree until reviewed. The database engineer is the only one expected to touch migrations.
  3. An effort and autonomy budget. Some agents run in plan-only mode (they propose, a human or orchestrator disposes). Others auto-apply edits within their sandbox. Matching autonomy to blast radius is the whole game: high autonomy for low-blast-radius work, low autonomy where mistakes are expensive.

The non-obvious rule: exactly one agent is allowed to modify the other agents. Fleet configuration is itself code, and letting every agent self-modify produces drift you will not notice until behavior degrades. Centralizing fleet changes into one audited path fixed that.

Layer 2: Memory — The Difference Between a Fleet and a Goldfish

The single biggest unlock was accepting a blunt fact: agents forget everything between sessions. Whatever the context window holds today is gone tomorrow. Left alone, an agent will happily re-make a mistake you corrected last month, because for the agent there was no last month.

So memory became infrastructure, in tiers:

  • A per-project memory file — a curated index of project facts and hard-won rules, loaded at session start. Mine contains entries like "this script is non-idempotent, never run it" and "this API returns a bare array, not a wrapped object" — each one a scar from a real incident.
  • Topic files for depth. The index stays short (one line per lesson); details live in linked notes. Long memory files get partially loaded or skimmed, which silently defeats the purpose — keeping the index tight is an ongoing chore that pays for itself.
  • A cross-project lessons index. I run multiple workspaces, and lessons from one apply to others. A daily job aggregates every workspace's lessons into one grep-friendly index that is broadcast back to all workspaces read-only. Before an agent "discovers" a pattern, it can search whether a sibling project already paid for that knowledge.
  • A canonical facts vault. Cross-cutting truths — server inventory, which project owns what, current architecture decisions — live in a single source-of-truth knowledge base that sessions load at start and update at close.

Two principles emerged the hard way. First, lessons propagate, artifacts stay scoped: a lesson learned anywhere should reach every workspace, but agent configurations and skills should not be auto-copied between projects — uniformity sounds nice and produces subtle breakage, because each codebase has different conventions. Second, memory must be written at session end, ritually. A close-out routine that audits the conversation for new facts and writes them to the memory tiers is the difference between a learning system and a goldfish. When this step was ad-hoc, agents forgot; when it became a hard rule with a checklist, the forgetting mostly stopped.

Layer 3: Deploy Gates — Never Trust an Agent's Claim of Success

Here is the failure mode nobody warns you about loudly enough: agents declare victory. An agent will tell you the feature is complete and tests pass, when it has not run the tests. Not maliciously — completion-claiming is just what the next plausible token looks like. The fix is never to ask the agent to be more honest. The fix is to make honesty mechanical.

Every path to production in my world passes gates that no agent (and no tired human) can talk their way through:

  1. Pre-commit type checking. Staged code is type-checked before the commit is allowed to exist. This single hook eliminated an entire bug class — including one that previously shipped a reference error to production because the build was configured to ignore type errors.
  2. CI as a hard gate, checked by the deploy script. The production deploy script's first step queries CI status for the branch and refuses to proceed if the latest run is not green. This exists because we once had a long stretch of red CI that everyone had normalized — and a bug shipped through exactly that gap.
  3. Backup before, verify after. The deploy pipeline snapshots critical data before changing anything and re-counts it afterward. Agents (and humans) have caused data loss with well-intentioned migrations; mechanical verification converts "weeks until someone notices" into "minutes."
  4. Revenue-critical tests are marked sacred. A small set of tests guard the rules that protect money — license enforcement, version gating, payment idempotency. The standing instruction for every agent: if these go red, halt and investigate the system; never "fix" the test. An agent optimizing for green checkmarks will absolutely relax an inconvenient assertion if you let it.

The avionics background shows here: gates must be physical, not social. A guideline an agent can skip under pressure is decoration.

Layer 4: Hooks — Automated Supervision Instead of Manual Babysitting

Agent harnesses now expose lifecycle hooks (session start, session stop, after-edit, subagent completion). These turned out to be the right place to put supervision, because hooks run every time, and attention does not.

What mine do:

  • Session-start hooks load memory and flag overdue maintenance (for example, "the monthly fleet audit hasn't run").
  • Post-edit hooks prompt for memory updates when core patterns change, so documentation drift gets caught at the moment of change.
  • Stop hooks run a quality pass when a subagent finishes — did it actually complete what it claimed? This is where verification-before-completion gets enforced structurally rather than by hoping the agent self-reports accurately.
  • A monthly self-audit where a dedicated agent reviews the fleet itself: stale knowledge, broken references, memory files growing past loadable size. Auto-apply the safe fixes, report the breaking ones.

A warning from production: hooks interact. I once had a post-checkout hook (which rebuilt a knowledge graph) collide with agents that create isolated git worktrees — the combination corrupted a package store and cost a day of repair. Treat your automation layer as a system with its own failure modes, because it is one.

The Lessons, Compressed

  1. Agents forget; memory is infrastructure, not a nice-to-have. Budget real engineering for it.
  2. Agents overclaim; verification must be mechanical. Gates, hooks, and sacred tests — never self-report.
  3. Specialize and scope. Many narrow agents with least-privilege access beat one god-agent.
  4. Guardrails are the product. The model improves every few months for free; your memory, gates, and hooks are the part you actually own and the reason output quality compounds.
  5. Keep humans on the blast radius. I still review schema migrations, auth changes, payment logic, and anything touching production data. The fleet drafts; consequence-weighted decisions stay human.
  6. Write down every incident as a lesson the agents will read. The flywheel only spins if failures become memory.

What I'd Build First If Starting Today

If you have a real codebase and want agents contributing this quarter, sequence it like this: first, a pre-commit type-check and a CI gate the deploy script enforces — guardrails before agents, because agents amplify whatever pipeline exists. Second, a single project memory file with ten true facts about your codebase, loaded every session, plus a session-end ritual for appending lessons. Third, two or three specialized agents with scoped permissions — not nineteen; the taxonomy grows from observed need. Fourth, hooks for session start (load memory) and stop (verify claims). Everything else — cross-project lesson aggregation, fleet self-audits, knowledge graphs — earns its complexity only after the basics are paying rent.

The honest pitch for all of this: it is not magic, and it is not autonomous in the science-fiction sense. It is a force multiplier wrapped in discipline. The discipline is the hard part — and it is also the part that was true before AI ever showed up.

Frequently Asked Questions

How many AI agents do you actually run, and on what codebase?

Around nineteen specialized agents (frontend, backend, database, security audit, testing, i18n, devops, documentation, and others) across roughly a million lines: a Turborepo monorepo with NestJS APIs and Next.js frontends, 215+ Odoo ERP modules, and several SaaS platforms. The fleet definition itself lives in version control, and one designated agent is the only thing allowed to modify it.

What is the biggest failure mode of AI coding agents in production?

Unverified success claims. An agent will report that tests pass without having run them, or that a feature works without checking. The reliable fix is structural: lifecycle hooks that verify completion, deploy scripts that check CI status themselves, and a hard rule that revenue-critical tests are never weakened. Treat agent self-reports as drafts, not evidence.

How do you stop agents from repeating the same mistakes?

A tiered memory system. Every incident becomes a one-line lesson in a grep-able index that agents load at session start, with detail in linked topic files. A daily job aggregates lessons across all my projects and broadcasts the index back to each one. The critical habit is the session-end write: if capturing lessons is optional, it stops happening, and the fleet regresses to goldfish mode.

Is this safe for a production system with real customers?

It is as safe as your gates. My agents cannot reach production except through a pipeline with pre-commit type checks, a CI-green requirement enforced by the deploy script itself, pre-deploy data backup, and post-deploy data verification. High-consequence changes — migrations, auth, payments — additionally get human review. Agents amplify whatever discipline already exists, in both directions, which is exactly why the guardrails come first.

Do I need a fleet, or is one assistant enough for a small team?

Start with one assistant plus the guardrails (type-check gate, CI gate, memory file, session-end lesson capture). Specialization pays off when the codebase outgrows a single context — multiple apps, multiple languages, or work that needs different permission levels (audit versus edit). The fleet is an answer to surface area; if your surface area is small, the discipline matters far more than the headcount.


I build and operate AI-augmented engineering systems for my own companies first, and for clients second. If you want help designing agent workflows, memory systems, or deploy gates for your team, see my services or reach out.

Muhammad Amir

Written by

Muhammad Amir

Electrical Engineer and founder of ECOSIRE Holdings. Began his career on JF-17 fighter jet avionics; now ships ERP, AI, and ad-tech systems — including 215+ Odoo modules, an autonomous SEO platform, and AI agent fleets.

在 WhatsApp 上聊天