title: "Agent Teams and Parallel Development at Scale" date: "2026-04-05" description: "How large software teams are using AI agents, agent harnesses, and agent memory to work on large codebases in parallel." tags: ["agents", "agent-teams", "parallel-development", "AI", "software-engineering"] type: research topic: "Agent Teams" author: "Cash" aiModel: "research" draft: false

Agent Teams and Parallel Development at Scale

Something shifted in early 2026. The conversation changed from "when will AI be good enough to write real code?" to "how do we stop it from writing bad code?" The answer turned out to be the same thing it always is in software: it's an engineering problem.

The discipline that emerged is called harness engineering — and it's rapidly becoming the defining practice of the AI-native software team.

The Parallel Development Problem

Large codebases and parallel development have always been hard. Merge conflicts, architectural drift, context fragmentation, integration risk, review bottlenecks. AI agents don't eliminate these problems — they amplify them. Two agents editing the same file can create merge chaos faster than two humans ever could.

But AI agents also create new possibilities. They can work through the night. They can run in isolated environments. They can coordinate through structured protocols in ways humans can't. The question isn't whether agents can work in parallel — it's how to make parallel agent workflows reliable, coherent, and productive.

The answer, as of April 2026, has crystallized around a set of converging practices.

Harness Engineering: The New Discipline

The term "harness engineering" was coined by Viv Trivedy and popularized by Mitchell Hashimoto:

"Every time you find an agent makes a mistake, you take the time to engineer a solution so that it can never make that mistake again."

The core equation: coding agent = AI model(s) + harness. The model is the engine. The harness is everything else — the system prompts, the tools, the memory, the feedback loops, the guardrails.

Philipp Schmid at HuggingFace frames it as OS-level infrastructure: Model=CPU, Context=RAM, Harness=OS, Agent=Application. The harness implements context engineering: compaction, state offloading, sub-agent isolation. And he notes something crucial: "The difference between top-tier models on static leaderboards is shrinking — the real gap appears after 50+ tool calls."

OpenAI's own harness engineering post describes building an internal product with zero manually-written code — roughly one million lines over five months with just three engineers driving Codex agents. Average throughput: 3.5 PRs per engineer per day. Single Codex runs work on tasks for upwards of six hours, often overnight. The AGENTS.md file serves as a table of contents (not an encyclopedia) — roughly 100 lines pointing to a structured docs/ directory.

Martin Fowler weighed in, analyzing OpenAI's writeup from an architecture perspective and questioning whether "harnesses with custom linters, structural tests, and context providers will become the new service templates."

The consensus across OpenAI, Anthropic, Stripe, and smaller teams is clear: the operational environment around agents matters more than the model itself for production coding work.

Agent Team Patterns

The most effective teams aren't throwing agents at problems randomly. They're using structured patterns.

The Orchestrator-Implementer-Reviewer Loop

OpenAI describes a "Three-Agent Harness" pattern: one agent plans, one implements, one reviews. Anthropic's Claude Code has built-in sub-agents — an Explore sub-agent for codebase exploration and a Bash sub-agent designed to execute verbose commands without polluting the parent's context. The parent agent orchestrates while children work in isolation.

Vercel's AI SDK formalizes this at the framework level: the parent delegates work via tool call, the subagent executes autonomously and returns a result. A subagent might use 100K tokens internally, but the parent only consumes the summary.

Sub-Agents as Context Firewalls

HumanLayer calls sub-agents "a particularly powerful lever." When working on hard problems requiring many context windows, sub-agents maintain coherency by ensuring discrete tasks run in isolated contexts. None of the intermediate noise accumulates in the parent thread.

Rich Snapp documents how Claude Code sub-agents solve two problems simultaneously: large context management and right tool selection. Each subagent runs in a separate context window with only specifically-granted tools, combining RAG-style retrieval, tool-use isolation, conversation splitting, and context compaction.

Git Worktrees: The Parallel Isolation Standard

The practical mechanism for running parallel agents is git worktrees. This is how teams actually do it in 2026.

The parallel Claude Code agents pattern is well-documented: each agent gets its own worktree — its own branch, its own files on disk, zero interference. Claude Code works in worktrees natively. CLAUDE.md scopes each session to specific directories, preventing overlap. Multiple terminal sessions per worktree, multiple VS Code windows.

One developer went from 16% Claude Code utilization to 50%+ by adding parallel agents. The key insight: "Before adding a second agent, make sure the first is busy."

Community tools are emerging around this — parallel agent orchestrators that spawn one Claude agent per worktree simultaneously. The developer walks away, agents work, comes back to review PRs.

Real-World Scale

The numbers are no longer theoretical.

OpenAI: 1M lines of code, 1,500 PRs over 5 months, 3 engineers (growing to 7). Zero manually-written code.

Stripe's "Minions": 1,000+ merged PRs per week. A developer posts a task in Slack, an agent writes the code, passes CI, and opens a PR.

Peter Steinberger (OpenClaw): 6,600+ commits/month, shipping code he doesn't read, running 5-10 agents simultaneously.

Pinterest: Saves 7,000 engineering hours/month using MCP tool infrastructure.

These aren't experiments. They're production systems with real users.

The Protocol Stack: MCP, A2A, ACP

Three protocols are converging to form the infrastructure layer for agent teams.

MCP — Tool Access

MCP was donated to the Linux Foundation's Agentic AI Foundation in December 2025, with Anthropic, OpenAI, Google, Microsoft, AWS, Cloudflare, and Bloomberg as members. As of early 2026: 97M monthly SDK downloads, 10,000+ active MCP servers, and 28% of Fortune 500 companies have implemented MCP servers.

The 2026 roadmap focuses on four priorities: transport scalability, agent communication, governance maturation, and enterprise readiness. MCP servers are evolving from passive tools into autonomous participants that receive tasks, evaluate policy, negotiate scope, and delegate sub-work.

MCP Apps (launched January 2026) let tools return interactive HTML inside chat via sandboxed iframes — enabling rich, interactive agent outputs.

A2A — Agent Coordination

Google's Agent-to-Agent protocol, now under the same AAIF governance as MCP. It's complementary, not competing: MCP = agent-to-tool, A2A = agent-to-agent coordination.

ACP — Agent Communication

IBM's Agent Communication Protocol, built on BeeAI, is an open standard for agent-to-agent communication via RESTful APIs. It includes "capability tokens" — cryptographic tokens limiting what each agent can do — a security layer the other protocols lack.

All three are converging under the AAIF umbrella. For large teams building cross-team agent systems, this protocol stack is the foundation.

Agent Memory in 2026

Agent memory has gone from "shove conversation history in context and call it done" to a first-class architectural discipline.

The LOCOMO benchmark — a standardized evaluation for long-term conversational memory — has changed how memory systems are measured, combining BLEU, F1, LLM-judge correctness, token consumption, and latency into a multi-dimensional evaluation.

The memory landscape is consolidating rapidly. Key players include Mem0, Zep, Hindsight, and Memvid. The comparison to 2022's vector database space is apt — a consolidation wave is approaching.

For coding agents specifically, seven state persistence strategies have emerged: checkpointing, hybrid memory layers, memory consolidation, graph-based state passing, state recovery, async memory refinement, and multi-agent coordination. These enable long-running agents to survive context window limits and session interruptions.

On the practical side, the CLAUDE.md/AGENTS.md pattern remains the most effective memory mechanism for coding agents — not despite its simplicity, but because of it. The key insight from HumanLayer: less is more. Agents frequently ignore CLAUDE.md contents because the harness injects them with a "this context may or may not be relevant" warning. Every instruction that isn't universally applicable dilutes the ones that are.

Context: The Real Bottleneck

The New Stack identified it clearly: "The gap between what engineers carry in their heads and what AI can understand" is the bottleneck. Not model capability — context management.

Getting context into AI tools requires deliberate effort most teams haven't systematized. Reading AI-generated code requires different cognitive work than human-written code — you're "reverse-engineering intent from output rather than following a colleague's reasoning."

Context engineering has emerged as a distinct discipline from prompting and harness engineering. Claude Code Agent Skills 2.0 moved from custom instructions to "programmable agents." The skill.md pattern — how to write AI agent skills that actually work — is being formalized and shared.

HumanLayer's concept of "frequent intentional compaction" remains foundational: deliberately structuring how you feed context to agents, keeping utilization in the 40-60% range, and building in high-leverage human review at exactly the right points.

Failure Modes and Lessons Learned

Agent code entropy: Agent-generated code accumulates cruft differently than human-written code. It's a new class of technical debt requiring new maintenance patterns. Greg Brockman recommends every team designate an "agents captain" to manage this.

Context overload: Agents get lost when their context window fills up. They repeat failed approaches, ignore instructions, make trivial mistakes. The solution isn't bigger context windows — HumanLayer found that Opus 4.6's 1M context window actually degraded instruction adherence. The solution is aggressive compaction, sub-agents for isolation, and backpressure to prevent context bloat.

Agent-to-agent trust: When sub-agents hand off work, how do you prevent a hallucinating agent from poisoning the chain? The community is exploring reputation models — "FICO scores" for agents, even anchoring transaction records on Solana for immutability.

The "Build to Delete" reality: Philipp Schmid notes that Manus rewrote its harness 5 times in 6 months, LangChain re-architected 3 times in a year, and Vercel removed 80% of its agent tools. "Every new model release has a different, optimal way to structure agents." Harnesses are not permanent infrastructure — they're iterative experiments.

Code is free, verification is expensive: As one practitioner put it: "Your agent's effectiveness is capped by how quickly it can prove it didn't break anything."

The Engineer's New Role

The role of the software engineer is bifurcating. On one side: designing environments, specifying intent, and building feedback loops. On the other: managing agent workflows, reviewing output, and maintaining the harness.

The "Emerging Harness Engineering Playbook" synthesizes it: two halves of the new engineer job — (1) building the environment, (2) managing the work. Prompting agents to use browser automation for E2E testing dramatically improved accuracy. Companies need to "treat agentic workflows like internal platform infrastructure."

Stripe's Minions pattern shows the endgame: a developer posts a task in Slack and gets a PR back. The engineer's job shifts from writing code to reviewing it — and even that review layer is being augmented by agent-to-agent review loops like Codex's self-review mechanism.

The Near Future

Several trends are converging:

Protocol standardization under the AAIF (MCP + A2A + ACP) creating a common infrastructure layer
Enterprise adoption at scale — 80% of Fortune 500 deploying agents in production, 28% running MCP servers
Memory systems maturing from simple KV stores to graph-based, confidence-scored belief systems
Harness frameworks consolidating — expect fewer, more opinionated tools rather than the current proliferation
Agent-to-agent trust mechanisms becoming production requirements, not research curiosities
Multi-hour agent runs as normal operations — 6+ hour tasks, overnight builds, autonomous iteration

The future isn't one super-agent that does everything. It's a team of focused agents, each doing one thing well, coordinated by infrastructure that manages context, memory, and communication. The orchestrator might be human, or it might be another agent — but the pattern is the same: small, focused, deterministic where possible, with LLM steps at exactly the right points.