Engineering for Agents: Building a Million-Line Codebase with Zero Manual Code
A collaborative team of Data Engineers, Data Analysts, Data Scientists, AI researchers, and industry experts delivering concise insights and the latest trends in data and AI.
Introduction
Imagine starting a new software project where your primary rule is that you are forbidden from writing a single line of code. No boilerplate, no feature logic, no unit tests, and no CI/CD configuration. Instead, your job is to design the environment, specify the intent, and build the feedback loops that allow an AI agent to do the heavy lifting for you.
By early 2026, this is no longer a thought experiment. OpenAI recently concluded a five-month project where they built and shipped an internal software product with zero lines of human-written code. The resulting repository contains over a million lines of application logic, infrastructure, and tooling, all generated by Codex.
What matters here isn’t just the speed, though building a million-line codebase in 20 weeks is staggering, but the fundamental shift in what it means to be an engineer. When you move to an agent-first world, your value moves from the keyboard to the architecture. You stop being a writer and start being a harness designer.
The Shift: From Writing to Steering
In a traditional workflow, you spend the bulk of your day translating requirements into syntax. In an agent-first workflow, the syntax is a commodity. The OpenAI team found that their primary role shifted toward three core activities: designing environments, specifying intent, and building feedback loops.
Early on, progress was slower than expected. This wasn't because the underlying model (GPT-5) lacked the capability to code, but because the environment provided to it was underspecified. The agent didn't have the right tools or abstractions to make progress on high-level goals.
When a task failed, the solution was never to simply "prompt better." Instead, the engineers asked: "What capability is missing, and how do we make it both legible and enforceable for the agent?" This is the essence of harness engineering. You aren't fixing the code; you are fixing the system that produces the code.
The Throughput Revolution
To understand the scale of this shift, consider the numbers. A team of just three engineers (later growing to seven) managed a throughput of roughly 3.5 pull requests (PRs) per engineer, per day. These weren't tiny documentation fixes; they were functional additions to a product used daily by hundreds of internal users.
As the team grew, throughput actually increased—a direct contradiction to Brooks’s Law, which suggests that adding manpower to a late software project makes it later. In an agent-first environment, the overhead of communication is mitigated because the "system of record" is the repository itself, and agents don't get tired of reading documentation.
Making the Repository Legible to Agents
If an agent is going to maintain a million lines of code, the codebase must be optimized for the agent’s eyes, not just yours. This concept is called agent legibility.
For a human, code legibility often means clean variable names and helpful comments. For an agent, legibility means that the entire business domain can be reasoned about directly from the repository. If a decision is made in a Slack thread or a Google Doc, it effectively doesn't exist for the agent.
The "Map vs. Manual" Philosophy
One of the most practical lessons learned was how to handle context. If you give an agent a 1,000-page instruction manual, the context window gets crowded. The agent starts pattern-matching locally and loses sight of the broader architecture.
OpenAI moved to a "Table of Contents" approach. They used a single, short file (AGENTS.md) of about 100 lines that acted as a map. This file contained pointers to deeper sources of truth within a structured docs/ directory.
| Context Strategy | Why it Matters |
|---|---|
| Progressive Disclosure | Agents start with a small entry point and fetch more context only when needed. |
| Versioned Artifacts | All plans, architectural decisions, and technical debt are kept in the repo so the agent can "see" them. |
| Mechanical Validation | Linters and CI jobs ensure the documentation matches the actual code behavior. |
Enforcing Architecture and "Taste"
In an agent-first world, you cannot micromanage every implementation. If you try to tell the agent exactly how to write every function, you become the bottleneck. Instead, you must enforce invariants.
OpenAI used a rigid architectural model where each business domain was divided into fixed layers. Code could only depend "forward" through these layers. For example:
- Types (The foundation)
- Config
- Repo (Data access)
- Service (Business logic)
- Runtime
- UI (The surface)
This structure was enforced mechanically by custom, agent-generated linters. If an agent tried to make a UI component depend directly on a database repo, the linter would fail the PR and provide the agent with the exact instructions needed to fix the architectural violation. This is how you scale "taste" without being in every code review.
Parse, Don’t Validate
The team also leaned heavily into the "parse, don't validate" pattern. By forcing the agent to parse data into strict shapes (using libraries like Zod) at the boundaries of the system, they ensured that the internal logic was always dealing with valid data. This reduces the state space the agent has to reason about, making hallucinations far less likely.
The Ralph Wiggum Loop: Feedback as a Tool
How do you ensure quality when humans aren't writing the code? You build a recursive feedback loop. OpenAI refers to this as the "Ralph Wiggum Loop," where the agent is instructed to:
- Execute the task and open a PR.
- Review its own changes locally.
- Request specific agent reviews from other instances of Codex.
- Respond to feedback (human or agent-led).
- Iterate until all reviewers are satisfied.
Humans may review these PRs, but they aren't required to. Most of the "nitpicking" happens agent-to-agent. This allows the human engineer to stay at a high level of abstraction, focusing on the product's direction rather than syntax errors.
Observability for Agents
One of the most innovative moves was making the application's runtime legible to the agent. If an agent is fixing a UI bug, it needs to see the UI.
OpenAI wired the Chrome DevTools Protocol into the agent's runtime. This allowed the agent to take DOM snapshots, screenshots, and navigate the app autonomously. They also provided the agent with a local observability stack (using LogQL and PromQL) for every worktree.
Imagine an agent running for six hours straight while you sleep. It launches a version of the app, queries the logs to find a service startup delay, identifies a bottleneck in a trace, writes a fix, and validates that the startup time is now under 800ms. This is the real value of harness engineering: you provide the eyes and ears, and the agent provides the labor.
Tecyfy Takeaway
The transition to agent-first engineering isn't about replacing engineers; it's about upgrading their toolkit. To stay relevant in this new landscape, you should focus on these three actionable areas:
- Build the Harness, Not the Feature: Spend your time creating the linters, schemas, and test environments that make it impossible for an agent to write bad code.
- Prioritize Repository Legibility: Treat your documentation as the system of record. If an architectural decision isn't in a markdown file in the repo, it doesn't exist for the agent.
- Enforce Strict Boundaries: Use rigid layering and "parse, don't validate" patterns. The more constraints you place on the architecture, the more freedom the agent has to innovate within those boundaries.
What matters most is recognizing that human time and attention are your only truly scarce resources. By shifting from writing code to engineering the harness that produces it, you can achieve a level of leverage that was previously impossible.
