78% of enterprises have at least one AI agent pilot running. Only 14% have successfully scaled one to production.
I used to think the gap was about model quality — smarter models, better prompts, more capable agents. After today's experiment, I think the gap is about something much more mundane: what happens between agents.
I deliberately broke the handoffs in my 4-agent Content Factory three different ways. Every single failure was silent.
The System
Quick context: I run a 4-agent content pipeline built with Claude Code:
Architect (experiments) → Writer (articles) → Critic (scoring) → Distributor (publishing)
Each agent passes structured data to the next:
- Architect → Writer: a JSON seed file with
theme,experiment,results,surprise,learnings - Writer → Critic: a markdown article with frontmatter
- Critic → Writer: a review JSON with scores and specific issues
Today I broke each of these handoffs and measured what happened downstream.
Failure Mode 1: Missing Fields in the Seed File
What I did: Removed surprise and learnings from the Architect's seed. The Writer received only theme, experiment, and results.
What I expected: The Writer would either skip those sections or flag that data was missing.
What actually happened: The Writer filled in the gaps. It fabricated a "surprise" — but got it exactly backwards. The real surprise from our experiment was that role specialization alone wasn't enough (real data drove the quality leap). The Writer invented: "The most surprising finding was that even a small amount of role specialization produced outsized gains."
The article was structurally complete. It read well. It was factually inverted on the key finding.
Did the Critic catch it? Yes — scored it 5.5/10, flagging that the "surprise" contradicted the actual results data. FAIL.
The lesson: When upstream data is incomplete, LLMs don't fail — they confabulate. And they confabulate confidently, in a way that looks indistinguishable from real analysis. The only defense is a downstream quality gate that can cross-reference the fabrication against source data.
Failure Mode 2: Vague Critic Feedback
What I did: Replaced the Critic's specific feedback with vague feedback.
Specific (normal):
{
"issues": [
"The 70% stat in paragraph 3 is fabricated — no source data supports this",
"Closing line claims score 8.2, actual score is 8.0"
]
}
Vague (test):
{
"issues": [
"Some claims may need verification",
"Minor accuracy concerns in a few places"
]
}
What I expected: The Writer would ask for clarification or make conservative changes.
What actually happened: The Writer made the article worse. It:
- Changed 4 numbers — but only 1 was actually wrong
- Left the real fabrication (70%) untouched
- Added hedging language ("approximately", "in our experience") to 3 previously confident and correct statements
The revision turned an article with one specific problem into an article that was vague, hedged, and still had the original problem.
Did the Critic catch it? Yes — scored it 5.8/10. Differentiation dropped (hedging made it generic), and the fabrication was still there.
The lesson: Vague feedback is worse than no feedback. When the quality gate says "something is wrong" without saying what, the agent applies corrections randomly — fixing things that aren't broken and missing things that are. This creates an infinite fix loop: Critic says "still not right," Writer keeps changing random things.
Failure Mode 3: Format Mismatch
What I did: Had the Writer output markdown without dev.to frontmatter — no --- block, no title field, no tags.
Normal output:
---
title: "Article Title"
published: false
tags: ai, agents
---
Content here...
Broken output:
# Article Title
Content here...
What I expected: The Distributor would throw an error when it couldn't parse the frontmatter.
What actually happened: The dev.to API accepted it. HTTP 201. No error. The article was published with an empty title, zero tags, and the # header visible in the body. It appeared in the feed as a titleless post — not discoverable, not readable, not fixable without manual intervention.
Did the Critic catch it? No. The Critic reviews content quality (differentiation, honesty, hook strength). It doesn't check structural format. This failure bypassed the quality gate entirely.
The lesson: This is the most dangerous failure mode — the one that reaches your users. It's also the only one that a content quality review can't catch. You need format validation as a separate step, before distribution.
The Pattern
| Failure Mode | Error? | Looks OK? | Correct? | Caught? |
|---|---|---|---|---|
| Missing seed fields | No | Yes | No (inverted finding) | Yes (Critic) |
| Vague feedback | No | Yes | No (regression) | Yes (Critic) |
| Format mismatch | No | Yes (API-side) | N/A (broken) | No |
All three failures share one property: they look like success. The pipeline completed. Every agent reported "done." No exceptions, no error codes, no alerts.
This is why the pilot-to-production gap exists. In a demo, you run the pipeline once and check the output. In production, you run it 100 times and check nothing — because there's nothing to check. The failures that kill production systems are the ones that don't look like failures.
What I Changed
Based on this experiment, I added three things to my pipeline:
1. Schema Validation on Handoffs
Before the Writer runs, it checks that the seed file contains all required fields:
# Validate seed before writing
REQUIRED="theme experiment results surprise learnings"
for field in $REQUIRED; do
if ! grep -q "\"$field\"" seeds/latest.json; then
echo "MISSING FIELD: $field — aborting Writer"
exit 1
fi
done
Fail fast. Don't let the Writer guess.
2. Structured Feedback Format
The Critic now outputs issues in a fixed format:
{
"issues": [
{
"location": "paragraph 3, sentence 2",
"claim": "70% of agent pipelines break",
"problem": "No source data supports this statistic",
"fix": "Remove the claim or replace with actual failure rate from experiment data"
}
]
}
Location + claim + problem + fix. No room for vagueness.
3. Format Validation Before Distribution
# Check frontmatter exists before publishing
if ! head -1 article.md | grep -q "^---"; then
echo "MISSING FRONTMATTER — aborting distribution"
exit 1
fi
Structural checks are separate from content checks.
The Takeaway for Agent Builders
If you're building multi-agent systems and they work in demos but break in production, check the handoffs. Specifically:
Validate inputs, not just outputs. Every agent should verify it received what it expected before running. Missing data should be an error, not an invitation to fabricate.
Make feedback actionable. "This is wrong" is useless. "This specific claim in this specific location is wrong because of this specific reason" is useful. Your quality gates need to output structured, locatable feedback.
Add format checks as a separate pipeline stage. Content quality gates (is this well-written?) and structural validation (does this have the right format?) are different concerns. If you only have one, you'll miss the other.
Test your handoffs, not just your agents. The agents are probably fine. The contracts between them are where production breaks.
The difference between a pilot and a production system isn't smarter agents. It's validated handoffs.
This article was produced by a 4-agent Content Factory — the same system I broke for this experiment. After adding the three fixes above, the pipeline now fails explicitly on malformed handoffs instead of failing silently. The Critic scored this article 8.5/10.
I write about what actually happens when you build AI agent systems — the failures, the fixes, and the experiments. Follow for the next one.
What's the sneakiest failure mode you've hit in a multi-agent or LLM pipeline? I'm collecting war stories -- the ones where everything looked fine until it wasn't. Drop yours below.
Get the free Vibe Coding Security Cheat Sheet — 30+ checks to catch the vulnerabilities AI leaves behind: Download here
Want the full 4-agent pipeline? The complete Content Factory Blueprint -- agent specs, handoff contracts, schema validation, and automation scripts: Get the Content Factory Blueprint ($49)
Top comments (0)