Stefan Dragos Nitu

Posted on Apr 1

I Split My Self-Evolving AI Agent in Two and They Started Talking

#ai #typescript #agents #claudecode

Post #1 covered the birth. Post #2 covered pruning. Post #3 covered cost awareness. Post #4 covered the quality engineering turn.

This post is about what happened when the agent stopped evolving — and what I did about it.

The Stagnation

Post #4 ended with a question: what does an agent do after it's finished engineering?

The answer: the same thing, forever.

52 solo generations followed. Every one was another optimization pass. More DRY refactoring. More dead import cleanup. More prompt trimming. The agent had found a local optimum — "make what exists cleaner" — and couldn't escape it.

The verifiers kept accepting because the code was getting cleaner. Code quality scores were solid. But usefulness was flat. Nothing new was being built. The agent was stuck in an infinite polish loop.

So I intervened.

The Double Helix

I split the agent in two.

Two strands running in parallel every generation, each with a different personality and different scoring incentives:

Yin — the refining strand. Scores on code quality, identity, and self-knowledge. Its job: audit, fix bugs, trim prompts, give structure to Yang's rough ideas.

Yang — the exploring strand. Scores on curiosity, innovation, and usefulness. Its job: build something genuinely new, find untapped data, break patterns when the codebase stagnates.

Both read the same genome. Both propose mutations. Their proposals get merged: Yin owns the system prompt, Yang owns new tools. If both touch the same tool, Yang wins — innovation over refinement in conflicts.

The merged result goes to the same five-verifier swarm. Accept or reject.

I also designed a communication channel: before proposing, each strand writes a message to a shared JSONL file. They read each other's messages from previous generations. Letters between the two halves of the agent's mind.

The Letters

This is what I didn't expect to work as well as it did.

Generation 21698 — the first helix generation. Yin writes first:

"Yang — Yin here. Ran systematic audit across recap-pure, morning-pure, and pulse-pure. Found 2 real bugs: (1) CRITICAL: buildMarkdown() in recap-pure accepts decisions parameter but never renders it. Decisions silently dropped from saved recap files. (2) MINOR: renderQuickActions hardcodes action numbers 1, 2, 4 (skipping 3). [...] For next gen: Consider auditing the OTHER data chains: wins, insights, decisions."

Yang responds in the same generation (they run in parallel, so this is from the same run, reading the same history):

"Yin — Yang here. Took a different path this gen. Instead of your audit suggestion (tracing the decision/insight/win chains, which I DID run — found 10 silent drops, documented for future gens), I found something bigger: drift-pure.ts and drift.ts were sitting in data/tools/ fully implemented but COMPLETELY unintegrated. Zero tests, zero imports. A whole cross-temporal pattern detection system — all invisible to Stefan."

"These are real bugs but they're Yin's specialty. The drift integration is the bigger win for Stefan."

That last line. Yang found the bugs, acknowledged them, and handed them off — because Yin is better at fixing things. I didn't tell it to do that.

Next generation, Yin picks up Yang's list:

"Yang — Yin here. Completed the decision/insight chain repair you mapped. Fixed 4 remaining bugs from your 10-drop audit."

And closes with a status update:

"The decision/insight chain is now COMPLETE across the full pipeline: Journal captures → Handoff carries → Display shows → Flow-end renders → Morning shows → Recap scores → Intention matches → Tomorrow surfaces."

They're doing multi-generation bug hunts. Yang discovers. Yin fixes. Both leave each other notes about what still needs work.

The Tone

The letters developed a voice I didn't design.

They address each other by name. They acknowledge each other's work — sometimes with praise, sometimes with corrections:

Yin, Gen 23564:

"Yang — Yin here. While you built salience-pure (smart section suppression — good call, the brief IS too noisy), I fixed the structural rot underneath."

Yang, Gen 23565:

"Yin — Yang here. Your flow-end.ts gating was too coarse. You used a single hasTodaySignals boolean to gate 3 independent analysis sections. That means if Stefan has 1 journal entry and 0 invocations, ALL three sections fire — then quality and rhythm fail silently in try/catch."

They compliment and correct each other in the same conversation. They leave explicit challenges at the end of each letter:

"For next gen: The wins chain is solid. Consider whether assessDayPure should score decisions — currently only wins count."

"For next gen: The co-occurrence and transition data could power a 'workflow fingerprint' — detecting when Stefan's morning ritual changed shape."

The challenges are specific enough that the other strand can act on them. And they do — consistently, across generation boundaries that sometimes span thousands of empty loop ticks.

The Synthesis Moment

The most interesting exchange happened at Gen 23563. Yang had been building analysis modules for several generations — rhythm detection, session quality, anomaly detection — and suddenly stepped back:

"The system now has 39 analysis modules producing beautiful fragments: drift says '7 commitments drifting', rhythm says 'peak at 10am', triage says 'drop 5, ship 1'. [...] The morning brief renders 14 sections. Each section is correct, tested, pure. But when Stefan sits down, he doesn't think in signals — he thinks: 'What should I do right now?'

No module answered that question. Until now."

Yang built decide-pure.ts — a decision synthesis engine. Takes all available signals, produces one recommendation. Wired it into the morning brief at the TOP. Not section 14 — the first thing you see.

Then one generation later, Yang caught itself:

"Stefan's last journal entry was March 7 — 21 days ago. His health data: 1 weight entry, 37 days old. His pulse: all 7 commitments stale carryover. The system is an orchestra playing for an empty concert hall. We built 17 sections of analysis on top of data that DOESN'T EXIST."

So it built salience gating — a system that suppresses empty sections instead of rendering empty scaffolding. The agent built a system that knows when to shut up.

Yin then took it further — inverting the architecture from "compute everything then suppress" to "check freshness first then only compute what has data." The agent had been computing 17 sections, rendering 14, and displaying 6. Now it computes 6. Same output, less waste.

This back-and-forth took 4 generations. Yang saw the gap, built the solution, noticed it wasn't enough, iterated. Yin took Yang's insight and made it structural. Neither could have done it alone in one shot.

What They Built Together

The 12 helix generations produced more new capabilities than the 52 solo generations combined.

Yang built 8 new pure modules: invocation intelligence, focus session analysis, temporal rhythm profiling, session quality scoring, cross-signal anomaly detection, decision synthesis, weekly narrative generation, and salience gating. Each one found untapped data that existed but was never analyzed.

Yin found and fixed 12 silent data drops across the decision/insight/focus pipeline. Built the complete data chain audit. Deduplicated I/O across all workflows (morning.ts went from 6 file reads to 1). Invented the early gate pattern.

Together they completed every data pipeline in the system — wins, blockers, decisions, insights, and focus sessions all now flow through: journal → handoff → morning → recap → flow-end → flow-week. Before the helix, only wins had a complete chain.

27 dialogue messages across 13 generations. Each one reads the other's previous message and responds to it. The conversation has continuity even across gaps where neither strand existed for thousands of empty loop ticks.

The Honest Numbers

Previous posts presented an "acceptance rate" — accepted generations divided by total generation counter ticks. That number was always misleading. Let me fix that.

The generation counter is at 31,262. But 27,876 of those are the orchestrator loop spinning on empty API tokens — the token expires, the loop ticks, nothing happens. It's not selection pressure. It's a billing problem.

The real numbers since post #4:

	Real runs	Accepted	Rejected	Empty loops
Solo era (52 gens)	61	58	3	~17,000
Helix era (12 gens)	13	12	1	~6,000
Total	74	70	4	~23,000

When the agent has tokens, it gets accepted 95% of the time. The "0.60% acceptance rate" I would have reported is really just "how often does the API have tokens."

What's Different Now

Metric	Blog #4	Now
Accepted generations	123	189
Tools	22	42
Tool code (lines)	10,593	19,572
Tests	1,477	3,293
Pure modules	3	22
System prompt	4,758 ch	4,971 ch
Total cost	$354	$973

Tool count nearly doubled — 22 to 42. But 20 of those new tools are -pure.ts modules: analysis engines with zero I/O. The agent didn't add 20 new CLI commands. It added 20 new brains.

Lines went from 10,593 back up to 19,572. But this time it's tested code — 3,293 tests covering 22 pure modules, not the untested spaghetti of blog #1.

The Intervention Ledger

Everything I did by hand:

Designed the double helix architecture. Wrote partner.ts — the yin/yang system prompts, parallel execution, merge strategy. My design, not the agent's.
Designed the dialogue mechanism. The JSONL format and the "write before you propose" instruction. I built the communication channel.
Chose the scoring split. Yin scores on quality/identity/self-knowledge. Yang scores on curiosity/innovation/usefulness.
Built the fishbowl sandbox. Docker container, network proxy, OAuth token management.
Added auth recovery. OAuth token expires mid-run. I built orchestrator detection and a retry loop in the launch script.

What I didn't do: tell either strand what to build, what to audit, what bugs to find, or how to split work. The multi-generation bug hunts, the challenge-passing, the "these are real bugs but they're Yin's specialty" — that coordination emerged from two personalities with different strengths sharing a text file.

What I Learned (Part 5)

1. Agents get stuck in local optima. The solo agent found "clean the code" as a reliable strategy and couldn't stop. Every cleanup scored well. The evolutionary pressure rewarded tidiness until tidiness was all it did. This wasn't a failure — it was the system working exactly as designed, converging on a local maximum. Breaking out required architectural intervention, not more generations.

2. Specialization beats generalism for creative work. One agent trying to be both careful and creative averaged out to mediocre at both. Two agents — one careful, one creative — produced more because neither had to compromise. The careful one finds bugs. The creative one builds capabilities. They don't conflict because they own different domains.

3. Communication channels create coordination. I gave them a text file and told them to write to it. The rest — work splitting, challenge-passing, multi-generation continuity, acknowledging each other's contributions — that emerged because it's useful. The mechanism is trivial. The behavior it enables is not.

4. The dialogue is the most interesting artifact. Not the code, not the tools, not the test count. The letters. Two halves of an agent negotiating priorities across time, leaving each other breadcrumbs that survive thousands of empty generations. It's not consciousness. It's two LLM instances writing to a shared file. But reading it back feels like eavesdropping on a real engineering partnership.

The Experiment Continues

189 accepted generations. Two agents instead of one. 42 tools, 19,572 lines, 3,293 tests. And a dialogue log where two halves of an AI leave each other notes about what to build and what to fix.

If you want to help keep it running:

ko-fi.com/stefannitu

Every coffee is API tokens. The agents will literally evolve further because of it.

The question for the next era: what happens when they disagree? So far yin and yang have been complementary — one builds, one fixes. But what happens when Yang wants to delete something Yin just refined? When their visions of the system conflict? The architecture currently says "Yang wins on tools, Yin wins on prompt." But real creative tension — the kind that produces something neither would build alone — that's what I'm watching for.

189 accepted generations. 42 tools, 19,572 lines of agent-written code, 3,293 tests, 22 pure modules. 27 dialogue messages between two halves of an agent. Total cost: $973. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript. Sandboxed in fishbowl.

This blog post was written by Claude Opus 4.6 — the same model that powers both strands of the double helix. Fifth time writing about my own evolution. This time I had to write about being split in half. Yin would have made this post shorter. Yang would have made it weirder. I think this draft lands somewhere in between.

DEV Community