Not as a gotcha. As a result.
Seven URLs. Seven FAILs.
My Hashnode profile is missing an H1. Three freeCodeCamp tutorials have meta descriptions that are either missing or over 160 characters. Two DEV.to articles have titles too long for Google to render cleanly.
I built the agent. I ran it on my own content first. That's the honest version of the demo.
The problem I was actually solving
Every digital marketing agency has someone whose job is basically this: open a spreadsheet, visit each client URL, check the title tag, check the description, check the H1, note broken links, paste everything into a report. Repeat weekly.
That person costs money. The work is deterministic. The only reason it's still manual is that nobody built the alternative.
I built it in a weekend.
The stack
- Browser Use — Python-native browser automation. The agent navigates real pages in a visible Chromium window. Not a headless scraper. Persistent sessions, real rendering, the same page a human would see.
- Claude API (Sonnet) — reads the page snapshot and returns structured JSON: title status, description status, H1 count, canonical tag, flags. One API call per URL.
- httpx — async HEAD requests for broken link detection. Capped at 50 links per page, concurrent, 5-second timeout per request.
-
Flat JSON files —
state.jsontracks what's been audited. Interrupt mid-run, restart, it picks up exactly where it stopped. No database needed.
Seven Python files. 956 lines total. Runs on a Windows laptop.
The part most tutorials skip: HITL
The agent hits a login wall. Throws an exception. Run dies.
That's most automation tutorials.
This one doesn't work that way.
When the agent detects a non-200 status, a redirect to a login page, or a title containing "sign in" or "access denied", it pauses. In interactive mode: skip, retry, or quit. In --auto mode it skips automatically, logs the URL to needs_human[] in state, and continues.
An agent that knows its limits is more useful than one that fails silently. That's the design decision most people don't make because tutorials don't cover it.
What the audit actually found
I ran it against my own published content across three platforms:
| URL | Failing fields |
|---|---|
| hashnode.com/@dannwaneri | H1 missing |
| freeCodeCamp — how-to-build-your-own-claude-code-skill | Meta description |
| freeCodeCamp — how-to-stop-letting-ai-agents-guess | Meta description |
| freeCodeCamp — build-a-production-rag-system | Title + meta description |
| freeCodeCamp — author/dannwaneri | Meta description |
| dev.to — the-gatekeeping-panic | Title too long |
| dev.to — i-built-a-production-rag-system | Title too long |
The freeCodeCamp description issues are partly platform-level — freeCodeCamp controls the template and sometimes truncates or omits meta descriptions. The DEV.to title issues are mine. Article titles that read well as headlines often exceed 60 characters in the <title> tag.
The agent didn't care. It checked the standard and reported the result.
The schedule play
python index.py --auto
Add a .bat file that sets the API key and calls that command. Schedule it in Windows Task Scheduler for Monday 7am. Check report-summary.txt with your coffee.
That's the agency workflow. No babysitting. Edge cases in needs_human[] for human review. Everything else processed and reported automatically.
What this actually costs
One Sonnet API call per URL. Roughly $0.002 per page. A 20-URL weekly audit costs less than $0.05. The Playwright browser runs locally — no cloud browser fees, no Browserbase subscription.
The whole thing runs on a $5/month philosophy. Same one I use for everything else.
The code
GitHub: dannwaneri/seo-agent
Clone it, add your URLs to input.csv, set ANTHROPIC_API_KEY in your environment, run pip install -r requirements.txt, run playwright install chromium, then python index.py.
The freeCodeCamp tutorial walks through each module — browser integration, the Claude extraction prompt, the async link checker, the HITL logic. Link in the comments when it's live.
The shift worth naming
Browser automation has been a developer tool for a decade. Playwright, Selenium, Puppeteer — all powerful, all requiring someone to write and maintain selectors. The moment a button's class name changes, the script breaks.
This agent doesn't use selectors. It reads the page the way Claude reads it — semantically, through the accessibility tree. A "Submit" button is still a "Submit" button even if the CSS class changed.
The extraction logic is in the prompt, not in the code.
Old way: Automation breaks when the page changes.
New way: Reasoning adapts. The code doesn't need to.
That's the actual shift. Not "AI does the work" but "the brittleness moved." From selectors to prompts. From maintenance to reasoning. The failure modes are different. So is the recovery.
Built this as the first in a series on practical local AI agent setups for agency operations. The freeCodeCamp step-by-step tutorial is coming. Repo is live now.
Top comments (64)
Interesting, but you don't need an LLM for this. Looking at your code, everything you're sending to Claude can be done directly in Python — with two advantages: zero cost, and a fully deterministic approach with no hallucination risk.
You're right that the extraction logic is deterministic — PASS/FAIL on character counts doesn't need a model. But the flags array is where it breaks down. "Title is 67 characters and reads like a navigation label rather than a page description" requires judgment a regex doesn't have. I wanted the output to be actionable, not just binary.
The cost argument holds though. For a pure character-count audit, Haiku at $0.001/URL is already trivial, but zero is less than that.
Where does your Python-only approach handle the ambiguous cases — pages where the title length passes but the content is clearly wrong for the query?
Fair point on the flags — if they're meant to carry semantic judgment ("reads like a nav label"), then yes, a model earns its place. But looking at your schema, the flags are still field-level: "title exceeds 60 characters", not "title is semantically weak". The ambiguous cases you mention — title length passes but content is wrong for the query — aren't in scope here.
That's actually a different tool. A two-pass approach makes more sense: deterministic Python for the binary checks, model call only on pages that pass the mechanical audit but need a second look. You pay per genuinely ambiguous case, not per URL.
The 2-pass framing is better than what I shipped. Deterministic filter first, model only on the survivors — you pay per genuinely ambiguous case, not per URL. That's the right architecture and I didn't build it that way.
The honest reason. I wanted one code path, not two. The added complexity of "run Python, decide if model is needed, run model conditionally" felt like scope creep for a tutorial. In production you're right. In a showcase meant to demonstrate the LLM layer, one pass made the demo cleaner.
Worth a follow-up piece though — "when to add a model to automation that already works."
That's an honest answer — and a better reason than the architecture. "One code path for a tutorial" is defensible; "LLM for character counts in production" isn't.
The follow-up angle is good. Another framing: "the cheapest model that solves the problem" — which sometimes is a regex, sometimes Haiku, occasionally something bigger. Cost and complexity as a sliding scale rather than a binary choice.
is a better frame than 2-pass because it generalizes. Regex → Haiku → Sonnet isn't a decision tree, it's a cost curve. You route based on what the task actually requires not on a predetermined architecture.
The piece writes itself, start with the character count example, work up through cases where Haiku is enough, find the edge where Sonnet earns it. Foundation does something like this implicitly — short queries hit a lighter path but I've never written it out explicitly.
Adding it to the queue.
"Cost curve" is sharper than what I said — I'll use that framing myself. The piece has a natural structure too: character count as the floor, work up to where Haiku plateaus, find the inflection point where Sonnet justifies the delta. Looking forward to reading it.
The inflection point is the piece. Not "use Haiku" or "use Sonnet" — find where the delta stops justifying the cost for your specific task. That's a decision most tutorials skip because it requires running both and measuring, not just recommending.
Glad the cost curve framing travels. I'll tag you when it's up.
Looking forward to it.
Great work! Hope you are well and it's been awhile.
It is a pain where you mentioned "open a spreadsheet, visit each client URL, check the title tag, check the description, check the H1, note broken links, paste everything into a report. Repeat weekly.". It is very time consuming and glad you made a project that tackle this big issue. Well done! :D
That weekly spreadsheet ritual is the thing nobody talks about when they pitch agency life. Good to hear from you Francis — been a while indeed.
The HITL design is the part that really stands out here. Most agent tutorials treat failure as an edge case — your skip/retry/quit approach treats it as a first-class workflow state. That's a huge difference in production.
I run a similar pattern on my own site (89K+ pages across 12 languages). The SEO audit agent checks GSC data, crawls pages, and files tickets — but the key insight I learned early was exactly what you described: the agent needs to know when it's out of its depth and flag for human review instead of guessing.
On the LLM vs deterministic debate in the comments — I think you nailed the response. The binary checks (title length, meta desc presence) don't need a model. But the qualitative flags ("this title reads like a navigation label") are where the LLM earns its $0.002/page. The hybrid approach is underrated.
Curious about your state.json approach — do you version it or just overwrite? I've found that keeping a rolling history of audit results is useful for tracking whether SEO issues are getting better or worse over time.
89K pages across 12 languages is a different beast entirely . The GSC integration is the piece I deliberately left out of v1 because it changes the architecture. You're not just auditing what's there, you're correlating with what Google sees. That's where the tool gets genuinely useful for agencies and genuinely complex to build.
On state.json — currently just overwrites. Your point about rolling history is the obvious v2. Even a simple append-per-run with a timestamp key would let you track PASS→FAIL regressions over time. That's probably more valuable than the initial audit for most clients.
What does your ticket-filing look like — do you route by severity or just dump everything into a queue?
Severity-based routing, but simplified. The agent categorizes into three buckets: broken (5xx errors, missing hreflang on entire page types, broken JSON-LD), degraded (short meta descriptions, thin content under 200 words), and cosmetic (title slightly over 60 chars, minor formatting). Broken gets a ticket filed immediately in Linear with a DEV- prefix. Degraded gets batched into a weekly ticket. Cosmetic gets logged but no ticket unless it affects a page that's actually ranking.
The key insight was that filing a ticket for every issue creates noise. When I first set it up, the agent generated 40+ tickets in a single run — nobody triages that. Now it deduplicates against existing open issues before creating new ones, which cut ticket volume by about 60%.
The GSC correlation you mentioned is where it gets interesting though. A page can pass every on-page audit but still sit in "crawled - not indexed" for weeks. That's where the tool stops being an auditor and starts being a diagnostic — and that's the harder problem to solve.
The three-bucket system is the right call. Binary pass/fail at scale is just noise with extra steps — the signal is in the severity routing, not the detection.
The deduplication against open issues is the piece I hadn't thought through. Filing a ticket for every run means the same issue gets reopened weekly until someone fixes it. Checking first whether it's already tracked changes the agent from a reporter into something closer to a monitor.
The crawled-not-indexed problem is a different class entirely. On-page signals are visible to the agent. GSC indexing state requires the API, a time dimension, and context the agent doesn't have — why was it crawled, when did status change, what changed on the page between crawl attempts. That's where you stop auditing and start investigating. Have you found a pattern in what actually resolves it, or is it mostly waiting and hoping Google recrawls?
Some patterns have emerged after watching 135K URLs go through the GSC pipeline over 3 months:
Content length matters more than people admit. Pages under 300 words almost never escape "crawled - not indexed." Once I expanded stock page analyses from ~200 to 600-800 words, indexed count jumped 81% in a single week (1,335 → 2,425).
Internal linking is the underrated lever. Adding "Related Stocks" and "Popular in Sector" widgets — basically creating a web of cross-links between stock → sector → ETF pages — seemed to help Google decide individual pages were worth indexing. The pages themselves didn't change, just their connectedness.
Hreflang cleanup had an outsized effect. My "alternate canonical" errors dropped from 682 to 83 after fixing hreflang tags. That correlated with the indexing spike, though causation is hard to prove.
What doesn't seem to work: just waiting. Pages that sat in "crawled - not indexed" for 6+ weeks without any changes rarely moved on their own. The trigger was always a content or structural change that gave Google a reason to re-evaluate.
The content length finding is the most actionable thing in this thread. 81% indexing jump from 200 → 600-800 words is a number worth putting in front of anyone who thinks thin pages are a technical problem rather than a content problem. The agent can flag under-300-word pages trivially — that's a len(text.split()) < 300 check, not a model call...
The internal linking point reframes what the audit should actually measure. Right now the agent checks whether links are broken. What it doesn't check is whether the page is sufficiently connected to the rest of the site. Connectedness isn't an on-page signal — it requires a graph, not a snapshot...
That's the v2 architecture: page-level audit for on-page signals, site-level graph for structural signals. The crawled-not-indexed diagnostic lives in the second layer...
You nailed the v2 architecture framing. The two-layer approach (page-level on-page signals + site-level graph for structural signals) is exactly where this needs to go.
The connectedness metric is something I've been thinking about a lot. Right now the agent catches orphaned pages and broken links, but it doesn't measure things like: how many clicks deep is this page from the homepage? Does every stock page link to its sector page and vice versa? Are there cluster gaps where a whole category of pages has no inbound internal links?
For a site with 89K+ pages across 12 languages, that graph analysis gets computationally interesting fast. But you're right — the crawled-not-indexed diagnostic almost certainly lives in that structural layer. Google isn't going to index a page that's 6 clicks deep with zero internal links pointing to it, no matter how good the on-page content is.
Appreciate the thoughtful breakdown. This is going on the backlog.
The click-depth metric is the one I'd prioritize in the graph layer. Broken links and orphan detection are node-level checks — you can do those without the full graph. Click depth requires traversal from a root, which means you need the graph to exist before you can query it. That's the architectural jump that makes v2 genuinely harder than v1, not just an extension of it.
That's a really sharp distinction — click depth as a graph-level metric vs. broken links as node-level checks. I ran into exactly this when building an internal audit agent for my 89K-page Astro site. The broken link scanner was straightforward (just verify each href resolves), but calculating click depth from the homepage required building a full adjacency map first. Ended up implementing it as a BFS from root, which works but gets expensive fast at scale. Curious — are you building the graph in-memory or persisting it somewhere? At a certain page count, holding the full link graph in memory becomes its own engineering challenge.
Haven't built v2 yet — the graph layer is still on the backlog, so I'm reasoning from first principles rather than production experience here...
That said: at the scale you're describing, in-memory is probably the wrong default. A full adjacency map for 89K pages with 12-language variants means the graph itself becomes the bottleneck before the traversal does. The natural fit is something like SQLite with a self-referencing edges table — persisted, queryable, incrementally updatable as pages change rather than rebuilt from scratch each run. BFS over a SQLite graph isn't as fast as in-memory, but at 89K nodes you're not doing this in real time anyway.
The incremental update problem is the interesting one. Pages get added, removed, relinked. Rebuilding the full graph weekly is expensive. Diffing against the previous run and only re-traversing affected subgraphs is the right architecture but significantly harder to implement correctly.
What's your current rebuild cadence — full graph each run or incremental?
Great question. Right now it's a full rebuild — the entire Astro site regenerates from the database nightly. At ~100K pages across 12 languages, the build itself takes about 45 minutes, and the internal linking is computed fresh each time based on the current stock/sector/ETF relationships in Supabase.
The SQLite edges table idea is really compelling though. Right now the linking logic lives in the Astro build templates (stock pages link to their sector, sector pages list top stocks, ETFs link to related holdings, etc.) so it's declarative rather than graph-traversal based. But if I wanted to do something smarter — like detecting orphaned clusters or optimizing click depth across the whole site — a persisted graph with incremental diffs would be the right architecture.
Honestly haven't hit the pain point hard enough yet to justify the migration, but as the page count grows (especially with TSX/Canadian stocks added recently), I can see it becoming necessary. The rebuild-from-scratch approach has the advantage of being simple and predictable, but it doesn't scale forever.
The declarative linking approach makes sense for the current architecture — the relationships are already in Supabase, the templates just express them. You're not traversing a graph, you're rendering known relationships. That's simpler and predictable, which matters when the rebuild already takes 45 minutes.
The pain point will probably show up as an orphan detection problem before a click-depth problem. Orphans are invisible to the declarative approach — a page that should link somewhere but doesn't requires knowing what the graph should look like, not just what it currently is. That's the gap a persisted graph closes.
TSX/Canadian stocks is the interesting pressure. More pages means more potential orphan clusters, more language variants means more hreflang surface area. The rebuild-from-scratch approach doesn't degrade gracefully — it either works or it takes longer. At some threshold that becomes the constraint.
You nailed the orphan detection issue — that's exactly where the declarative approach breaks down. Right now I'm relying on Supabase relationships (sector → stocks, exchange → stocks) to generate the links, which works for known clusters. But if a stock page exists without a sector assignment or has a stale peer list, nothing catches it. There's no "what should link here but doesn't" check.
The TSX expansion made this real. Going from ~8K to ~10.5K tickers across 12 languages means the rebuild went from manageable to borderline. We're not at the breaking point yet, but I can see it from here. Incremental builds are the obvious answer, but Astro's static output model makes that non-trivial — you'd need to track which pages actually changed at the data level and only rebuild those.
Honestly considering a hybrid approach: keep the full rebuild on a weekly schedule but do incremental deploys for daily price/news updates using a lighter template that skips the cross-linking pass. Trades some link freshness for build speed.
This is exactly why 'human-in-the-loop' is still the most critical part of AI workflows. I've been doing something similar with Next.js 15 dev cycles--building an automated system to catch hallucinations before they hit production. It's fascinating that even when the agent flags everything, the real work is in the systematic prevention of those patterns. Great read on the 'local-first' audit approach!
Hallucination detection before production is the harder version of the same problem — the audit has to be more reliable than the thing it's auditing.
Exactly right — and that is the core tension. My approach shifts the problem: instead of detecting hallucinations after generation, I use .mdc rules to constrain what the model can generate in the first place. It is not perfect but it converts a detection problem into a prevention problem, which is a lot more tractable.
The demo problem is real. Most agent tutorials audit a toy site specifically because it passes cleanly. Running it on your own published work means you can't curate the results — whatever the agent finds is what gets reported. The seven FAILs weren't staged.
Exactly, the real world is messy and that is where these agents usually fall apart. By showing the failures, it's easier to see exactly where the logic breaks down and how to improve the prompts or constraints to handle those edge cases.
Exactly. Hallucination detection before production is the harder version of the same problem - the audit has to be more reliable than the thing it's auditing. This is why I've moved away from "audit-after-the-fact" and more towards "generator-constraints" with .mdc files. If you can force the AI to follow the rule during generation, the audit becomes a lot simpler because you're already within the standard. It's the difference between testing for bugs and formal verification.
Detection-to-prevention is the right reframe. An auditor that runs after generation is always playing catch-up — the cost of fixing a hallucination compounds with how far it traveled before detection...
The .mdc constraint approach is interesting because it moves the enforcement into the generation context rather than a separate validation pass. The analogy to formal verification holds: you're specifying what correct output looks like before the output exists, not checking conformance after.
The limit is expressiveness. Formal verification works cleanly on systems with bounded state. Natural language generation has enough degrees of freedom that constraints leak — the model finds outputs that satisfy the rule syntactically but violate the intent. How are you handling constraint drift as the rules accumulate?
Constraint drift is a huge issue once you cross around 15-20 rules. I've found that grouping them into functional blocks (e.g. 'Data Fetching' vs 'Auth Patterns') and using a 'master' rule to keep the model from trying to satisfy too many competing constraints at once helps. It is definitely a balancing act between precision and flexibility.
The 15-20 rule threshold is useful to know. Below that, individual rules can stay precise. Above it, the model starts satisfying the letter of each rule while violating the spirit of the system which is worse than no rules because it looks compliant.
The master rule as a coherence layer is interesting. You're essentially adding a meta-constraint: satisfy the functional blocks without letting them conflict. That's a second-order problem most people don't hit until the rule count is already too high to untangle easily.
The grouping approach mirrors how you'd structure any constraint system — local rules for local concerns, global rules for cross-cutting concerns. The failure mode is when a local rule has global consequences nobody anticipated.
That second-order problem is exactly what we ran into. Our first attempt at 22 rules had the auth security rule and the middleware rule giving contradictory guidance -- one said 'always check auth server-side' while the other implied middleware was a valid enforcement point. The model would pick whichever it encountered last in context.
The fix was adding an explicit hierarchy: security rules always take precedence over convenience rules. And the project-context rule at the top acts as the coherence layer you described -- it defines the architectural invariants that no other rule can violate.
The Claude Code leak actually validated this approach. Their internal architecture uses a Deny > Ask > Allow permission pipeline with strict evaluation order. Same principle: when constraints conflict, the most restrictive one wins.
Exactly -- and that is the core tension. My approach shifts the problem: instead of detecting hallucinations after generation, I use .mdc rules to constrain what the model can generate in the first place. It converts a detection problem into a prevention problem, which is more tractable. The rules are just markdown with YAML frontmatter that Cursor reads based on file globs -- so the constraint is injected at generation time.
That is the core challenge. I've found that moving from "natural language instructions" to "strict architectural rules" helps the auditor too. If the auditor knows the codebase must follow a specific pattern (like awaiting all params in Next.js 15), then the audit becomes binary rather than subjective.
Love this. There's something genuinely useful about using the same tool to audit the work it helped you produce - catches the patterns you've normalized. Curious what the most common flag was. Was it style/tone or more structural things like missing context or weak conclusions?
Mostly structural — missing meta descriptions, titles over 60 characters, H1 count issues. Nothing about style or tone because the agent isn't reading for quality, it's checking against standards. The interesting case was freeCodeCamp's own template truncating descriptions on article listing pages — the agent flagged it as a FAIL and it technically is, even though it's platform-level and outside my control. Auditing your own work with your own tool finds the things you'd rationalized as acceptable.
The structural stuff makes sense - those are measurable so the agent can actually flag them. But that freeCodeCamp case is the interesting one. Platform-imposed truncation showing up as a personal FAIL is exactly the kind of thing you'd normally just rationalize away. The agent doesn't know context, so it flags it anyway. Weirdly that's the most honest kind of audit.
Context-free is the feature not the limitation. A human auditor would see "freeCodeCamp template" and mark it acceptable. The agent sees a missing meta description and marks it FAIL. Both are correct . they're answering different questions.
The agent answers: does this page meet the standard? The human answers: is this worth fixing given the constraints? You need both. The agent's job is to surface everything. Your job is to triage what actually matters.
The platform-imposed FAIL is useful precisely because it forces the triage decision to be explicit rather than assumed. You either fix it, escalate it, or document why it's acceptable. Any of those is better than normalizing it silently.
The agent/human split you're describing is exactly right. The agent answers "does this meet the standard" - the human answers "is this worth fixing given the context". Those are genuinely different questions and both useful. The platform-imposed failures are actually good signal - they're showing you the gap between your setup and the standard, even if you consciously chose that gap.
is the useful distinction. There's a difference between a FAIL you didn't know about and a FAIL you accepted. The agent can't tell which is which but surfacing both forces you to be explicit about which category each one falls into. The ones you assumed were acceptable without ever deciding they were is where the audit earns its cost.
Yeah exactly - the ones you assumed were fine without deciding they were is the honest gap. Surfacing it is most of the value.
Surfacing it is most of the work. The deciding is faster once you can see it clearly.
Exactly. The decision is quick once you stop rationalizing.
Solid architecture, Daniel. The use of flat JSON for state persistence is a smart move for local agents—keeps things portable and debuggable without the overhead of a database. It’s also interesting to see how Claude handles the accessibility tree instead of raw HTML. Definitely a more resilient way to build scrapers/auditors today.
The accessibility tree point is worth expanding on. Raw HTML gives you structure — the accessibility tree gives you intent. A div styled to look like a button is invisible to a scraper. Browser Use sees it the same way a screen reader would. That shift from parsing markup to reading meaning is what makes the extraction prompt reliable across different site architectures.
This is good. I like how you tested it on your own work instead of just using it on a demo. That makes it more real. I also like how you included HITL, as most people would skip this part and their script would just break.
The demo problem is real. Most agent tutorials audit a toy site specifically because it passes cleanly. Running it on your own published work means you can't curate the results whatever the agent finds is what gets reported. The seven FAILs weren't staged.
This is one of the more nuanced discussions I've seen on AI tooling. The "cost curve" framing Pascal and Daniel landed on is exactly the right mental model — not every task needs the same level of intelligence, and the real engineering challenge is routing to the cheapest model that solves the problem at each step.
I'd add that this pattern extends beyond content auditing. In production AI systems, we often see a tiered approach: deterministic rules first, lightweight models for triage, and larger models reserved for genuinely ambiguous edge cases. It keeps latency low, costs predictable, and reduces unnecessary LLM dependency.
Looking forward to the follow-up on finding that inflection point — measuring where the cost delta stops justifying the upgrade is something most teams skip but is critical for sustainable AI adoption.
The 3-tier pattern is the production version of what Pascal and I landed on in the abstract. Deterministic rules → lightweight triage → frontier model for edge cases maps directly to the cost curve: floor, middle, and the inflection point where the upgrade justifies itself.
The latency argument is the one I hadn't foregrounded. Cost is measurable upfront. Latency compounds in ways that aren't obvious until you're watching a 7-URL audit take 4 minutes because every page hits Sonnet regardless of complexity. Routing by task type fixes both problems simultaneously.
The follow-up piece has a natural structure now — build the three tiers explicitly, measure where each plateau hits, find the inflection empirically rather than guessing. That's a more useful article than
This is one of those builds where the result matters more than the tool—and the result here is brutally honest: automation doesn’t just scale work, it exposes it.
Running it on your own content first is the part most people skip—and it’s exactly why this feels credible.
A few thoughts that stood out:
“An agent that knows its limits” is the real innovation here
Everyone’s obsessed with autonomy, but in practice, graceful failure + HITL is what makes systems usable in the real world.
You didn’t just automate SEO—you productized a role
That weekly spreadsheet job you described? This replaces not just effort, but process. That’s a much bigger shift than “AI saves time.”
The semantic vs selector shift is huge
Moving from brittle selectors to reasoning via something like Claude API is basically:
from “tell the computer where to look”
to “let it understand what it’s seeing”
That’s a fundamental change in how automation is built.
Also worth calling out: your cost model (~$0.002/page) quietly kills a lot of SaaS in this space. A lot of “SEO audit tools” are now competing with… a weekend project.
If I had to compress the deeper insight:
Old automation scaled actions.
New agents scale judgment.
And your system shows something even more important:
judgment doesn’t need to be perfect—just consistent and cheap enough to run continuously.
Curious where you take this next—especially if you layer in diff tracking over time or auto-fix suggestions. That’s where it goes from audit tool → autonomous optimization system.
"Scales judgment not actions" is the sharpest compression of what's different here. Old automation needed you to specify the action precisely. The agent needs you to specify the standard . what good looks like and it handles the execution. That's a fundamentally different thing to maintain.
The cost model point is the one I think about most. The SEO audit SaaS market is built on the assumption that this problem requires infrastructure. A weekend project running at $0.12 per 20-URL audit doesn't kill the enterprise tools but it does kill the mid-market ones that charge $99/month for what is essentially a scheduled crawler with a dashboard...
On where it goes next. diff tracking is the natural v2. The audit is only interesting on the first run. What becomes interesting over time is whether issues are getting fixed, regressing, or accumulating. That's a monitoring system, not an audit tool and it's a much more valuable thing to sell.
Love the meta aspect of this — using AI to improve the content you write about AI. That feedback loop is underrated.
I've been exploring a similar angle but for a different source: git commits as content. Instead of auditing existing articles, I use the project's git history as raw material for new ones. Every commit message, every PR description, every refactor decision is a micro-story waiting to be told.
The interesting parallel with your approach is that both methods treat your own work as a dataset. You're mining your articles for quality signals; I'm mining my codebase for narrative signals. Both beat staring at a blank page trying to brainstorm "what should I write about."
Your point about the agent flagging every single article is humbling but honest. I'd be curious to know: what was the most common issue it found? Was it structural (flow, readability) or more about content depth?
Also — did you consider letting the agent suggest rewrites, or intentionally keep it as an auditor only to preserve your voice?
Git history as content source is the one I haven't tried. The interesting thing about that approach is the signal is already structured — commit messages have implicit categories (fix, feat, refactor), PR descriptions have implicit narrative (problem, approach, tradeoff), and the diff is the evidence. You're not generating content from nothing, you're surfacing decisions that were already made.
The most common flag was missing or over-length meta descriptions — structural SEO, not content quality. The agent doesn't read for depth or clarity, only against measurable standards. Style and readability would require a different audit layer entirely.
On keeping it as auditor only: deliberate. The moment it suggests rewrites it's optimizing for the standard, not for the voice. Those aren't the same thing.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.