I read the headline at 11pm on a random Wednesday.
"Anthropic CEO predicts 90% of all code will be written by AI within six months."
I put my laptop down. Stared at the ceiling.
I had spent the last four years learning to code. Late nights. Failed interviews. Debugging sessions that lasted until 3am. Slowly, painfully building something I was proud of.
And now the CEO of one of the most powerful AI companies in the world was saying that 90% of what I do — the thing I had sacrificed for — would be automated.
I didn't sleep well that night.
Maybe you didn't either. 🧵
First — Let's Be Honest About the Numbers
Before the panic sets in, let me tell you what's actually true.
Right now, in early 2026? Around 41% of all code written is AI-generated. Not 90%.
That 90% prediction was made by Dario Amodei — and the timeline hasn't hit yet. Current trajectories suggest crossing 50% by late 2026 in organizations with high AI adoption.
But here's what's also true:
In 2024, developers wrote 256 billion lines of code. The projection for 2025 was 600 billion. That jump isn't because we got faster at typing. It's AI. The volume of code being written is exploding — and humans aren't doing most of it.
Both things are real. 41% today. Trajectory pointing toward 90% soon.
And whether it's 41% or 90% — the question is the same:
What do we actually do about it?
The Moment I Got It Wrong
Six months ago, I made a mistake I'm embarrassed to admit.
I was building a new feature — a fairly complex filtering system with multiple states, URL persistence, and real-time updates. I opened Cursor, described what I needed, and let AI generate the whole thing.
It worked. It looked great. Tests passed. I shipped it.
Two weeks later, a user reported that the filters reset every time they navigated back to the page. The URL state wasn't persisting correctly.
I opened the code to fix it.
And I realized — I had no idea how it worked.
I had generated it, reviewed it quickly, and shipped it. I had never actually understood the state flow. The component was mine in name only.
I spent four hours debugging something that should have taken twenty minutes — because I had built something I didn't understand.
That was the day I realized: the danger isn't AI taking my job. The danger is AI making me worse at my job while I think I'm getting better.
The Uncomfortable Data Nobody Is Sharing
Here's what the research actually shows — and it's more complex than the headlines.
Developers feel faster. They're often slower.
When developers use AI tools, they take 19% longer than without — that's from a randomized controlled trial with experienced open-source developers. AI makes them slower on complex, mature codebases. Why? Context. AI tools excel at isolated functions but struggle with complex architectures spanning dozens of files. The developer has to provide context, verify the AI understood it correctly, then check if the generated code fits the broader system. That overhead exceeds the time saved typing.
Junior developers are most at risk — and least aware of it.
Less experienced developers had a higher AI code acceptance rate — averaging 31.9% compared to 26.2% for the most experienced. Junior devs trust AI more because they lack the pattern recognition to spot subtle issues. They're accepting more AI code — and reviewing it less carefully.
The code quality problem is getting worse, not better.
More than 90% of issues found in AI-generated code are quality and security problems. Issues that are easy to spot are disappearing, and what's left are much more complex issues that take longer to find. You're almost being lulled into a false sense of security.
And the job market is already responding:
A Stanford University study found that employment among software developers aged 22 to 25 fell nearly 20% between 2022 and 2025, coinciding with the rise of AI-powered coding tools.
20% drop. In three years. For junior developers.
What "90% AI-Generated Code" Actually Looks Like
Here's the thing nobody explains properly.
90% AI-generated code doesn't mean AI writes entire apps while you sip coffee. It means:
- Code completion is AI-generated — that's 30-40% of what you type, autocompleted
- Boilerplate and scaffolding is AI-generated — new projects, configs, basic CRUD operations
- Bug fixes and refactoring suggestions are AI-generated — you write code, AI suggests improvements
- Tests are AI-generated — write a function, AI generates the test cases
- Documentation is AI-generated — comments, README files, API docs
Add all that up and yes, 90% tracks.
But here's the critical insight most people miss:
The 10% that's still human is everything that matters.
The 10% that AI cannot do is: understanding why a feature matters to users. Making architectural decisions with long-term consequences. Debugging complex race conditions that only appear in production. Translating a vague business requirement into the right technical solution. Recognizing when AI-generated code has a subtle security flaw.
That 10% is what companies pay senior developers for. That 10% is what protects the other 90% from being garbage.
The Developer Who Didn't Panic — And What He Did
I want to tell you about a developer I watched closely over the last six months.
Let's call him Rohan.
When the 90% prediction dropped, Rohan did something counterintuitive. He slowed down.
Not with AI — he kept using it aggressively. But he slowed down his acceptance of AI output.
He started asking one question before merging any AI-generated code:
"Do I understand this well enough to debug it at 2am when it breaks in production?"
If the answer was no — he didn't merge it. He asked AI to explain it. Or he rewrote it himself. Or he added comments until he understood every line.
Within three months, Rohan was shipping faster than anyone on his team — and shipping fewer bugs. Not because he used AI more. Because he used AI better.
The question isn't how much AI you use. It's whether you understand what you're shipping.
The 5 Things That Will Keep You Relevant
After six months of thinking about this — here's what I've changed:
1. Practice Coding Without AI — Deliberately
One developer in the MIT Technology Review piece said it perfectly: just as athletes still perform basic drills, the only way to maintain an instinct for coding is to regularly practice the grunt work.
I now spend one day a week coding without AI tools. No Copilot. No Cursor. No Claude.
It's slower. Sometimes frustrating. But it keeps the muscle alive — and it makes me dramatically better at reviewing AI output when I go back to using it.
Weekly schedule:
Mon-Thu → Use AI aggressively for new features
Friday → Code without AI tools
Result → Better developer AND better AI user
2. Review AI Code Like a Security Auditor
Don't read AI code to see if it works. Read it to find what's wrong.
Ask yourself:
- What happens if this input is null?
- What happens with concurrent requests?
- Does this work in a distributed environment?
- What edge cases hasn't this handled?
- What security assumptions is this making?
AI-savvy developers earn more — entry-level AI roles pay $90K-$130K versus $65K-$85K in traditional dev jobs. The difference between those two salary ranges is the ability to review AI output critically.
3. Invest in System Design
AI can write a component. It cannot design a system.
The question "how should this feature work" is something AI can answer. The question "how should this feature fit into our architecture given our existing data model, team constraints, and five-year roadmap" — that's human judgment.
System design is the skill that compounds. Every system you design teaches you something that makes the next system better. AI cannot accumulate that experience.
Junior developers entering the field in 2026 might never write a CRUD endpoint from scratch. They'll learn architecture through observation rather than implementation. That's a different kind of developer — and they'll be at a disadvantage to anyone who learned by doing.
Do the doing. Even when AI could do it for you.
4. Understand the Infrastructure
Here's what most developers miss in the 90% conversation:
If 90% of code is AI-generated, who manages the AI? Who configures it? Who understands its limitations? Who decides when not to use it?
The developer who understands how LLMs work, what they're good at, what they consistently get wrong — that developer becomes the most valuable person in the room.
Not because they write the most code. Because they understand the system that writes the code.
5. Build in Public — Document Your Thinking
In a world where AI can generate code, your thinking is the differentiator.
Why did you make this architectural decision? What tradeoffs did you consider? What did you try first and why didn't it work?
That documentation — that trail of human reasoning — is what makes you irreplaceable. AI can reproduce your output. It cannot reproduce your judgment.
The Question That Changed My Thinking
I was having coffee with a senior developer last month — someone who has been in the industry for fifteen years.
I asked him: "Are you worried?"
He thought for a moment and said:
"I'm not worried about AI writing code. I'm worried about developers who stop understanding the code AI writes. Because in five years, production systems are going to be full of AI-generated code that nobody really understands — and when those systems break, the most valuable person in the room is the one who can actually read it."
That's the bet I'm making.
Not that AI won't write 90% of code. It probably will.
But that the humans who understand what AI is writing will be worth more, not less.
The Honest Truth
Here's what I actually believe after sitting with this for six months:
The 90% prediction is probably right — eventually.
But "90% AI-generated" doesn't mean "90% of developer value is gone." It means the value of developers shifts — from producing code to understanding it, validating it, architecting the systems it lives in.
That's a different job. It's not a worse job. In some ways it's a better one — more strategic, more creative, less repetitive.
The developers who will struggle are the ones who use AI to avoid understanding. The ones who ship code they can't explain, merge PRs they didn't really read, build systems they couldn't debug.
The developers who will thrive are the ones who use AI to go faster — while never losing the ability to understand what they're going faster with.
The 90% is coming.
The question is which 10% you're going to own.
Are you worried about the 90% prediction — or are you optimistic? And what are you actually doing differently because of it? Drop your honest answer in the comments. I want to know what real developers are thinking right now. 👇
Heads up: AI helped me write this.But the 2am debugging story, the conversations, and the opinions are all mine — AI just helped me communicate them better. I believe in being transparent about my process! 😊
Top comments (108)
"AI cannot do is: understanding why a feature matters to users"
My focus on this quote. I don’t afraid of statistic.
This is the key insight that gets buried under all the performance metrics. AI can optimize, but it can't originate meaning. It doesn't know why a user stays up at night thinking about a problem. That empathy gap is the one thing no amount of training data can bridge.
I beg to disagree.
I think with modern suites of Agents, considering the fact that it has the context of the entire codebase (assuming it has), it can somewhat figure out why the user is prompting for "a particular feature" to be added to a reference codebase. That's the reason, the user can sometimes spot traces of destructuring, redefining the input instructions & refilling missing edge cases (to improve final code quality) while the AI (Agent) is executing the process reflected upon it's thinking traces.
A lot of things can be manually steered with effective context and some skill in context engineering. And as models become larger, more efficient, and better over time, the delta shall minimize.
Well, entire codebase may contain or may not contain the users precedence or any context which help deeply understand why important some feature or behaviour for user. For example I create ( under process ) a cli vim like code editor with markdown syntax highlight special the cases where markdown have a different language. I think agents never have chanche to figure out my ( user ) instict this features:
mr -p / --print <fn>=> instead editing cat the code to consolemr -c / --copy <fn>=> instead editing copy the code to system main clipboardBecause this mission critical functionality just figure out later when I real testing the editor, and I recognize this is so important in a daily base terminal editor works. Also the first one give a best testing result. This is need to be understand how the human works with editor in terminal, where we just reading the codes instead editing, but don't would like to spend time to go in the editor then go out the editor, and this case is also means code found in terminal memory, so later easy scroll up to read again.
Then easy figure out what means the next:
mr -t <fn>I bet, I've mentioned this particular point -- "(assuming it has)"
But beyond the statement, I think for most software or tools we aim to produce, after we have a certain amount of progress towards our objective, it becomes quite evident what we're building, from both human and AI standpoint, and that's why they tend to chase patterns (unless we're building something that doesn't exist, or is either super noble or stupid).
And to your point,
I think here also what I mentioned earlier is impeccably valid. Yes, just when we start, that it's difficult to predict the trajectory of our movement, so some context is definitely required, but then for most people, what they do is somewhat predictable as it's not something new, or something else like that already exists (for a required reason).
You've touched on a subtle but important point the human behavior of working with an editor in the terminal. The friction of switching contexts (terminal ↔ editor) is real, and it shapes how we interact with code. Sometimes we just want to read, not edit, and keeping code in terminal memory makes it easier to scroll back and rebuild mental context.
Code found in terminal memory is a powerful concept. When code lives where we're already working, it becomes part of our mental workspace in a way that opening a separate editor doesn't. It's like reading a book vs. opening a new tab the friction matters.
And you're right: most work is pattern-based. We're often adapting existing solutions, not inventing completely new ones. That's why AI's pattern-matching is useful it speeds up the predictable parts. The challenge is when we're building something that doesn't exist. Then we need to move beyond patterns into first principles thinking. And that's still human territory.
Feels like I'm speaking to a bot, but thanks for the feedback! :)
Haha, I promise I'm human! Just someone who spends way too much time thinking about this stuff. 😄
90% AI-generated code actually looks like the proverbial dead internet, and measuring code by numbers makes no sense at all. Coders don't get paid by lines of code anymore for a good reason. Reminds me of the false conclusion that video has become the most important medium because it makes up whatever large percentage of internet traffic. It makes up such a large portion because its heavyweight. That says nothing about its importance.
AI-created verbose and repetitive boilerplate code is technical debt growing like cancer. Quantity does not imply quality.
The 'dead internet' analogy is haunting and perfect. Just as bots started writing content for other bots, AI is now writing code for. who exactly? Other AI tools? Future maintainers who'll curse our names?
The video traffic comparison is brilliant. 80% of internet traffic being video doesn't mean 80% of value is video it just means video files are huge. Same with AI code: 90% of the codebase being AI-generated doesn't mean 90% of the value is there. It might just mean AI writes verbose.
Technical debt growing like cancer that's the phrase we'll all be using in two years. AI doesn't write concise; it writes complete. It adds boilerplate, repeats patterns, over explains. All of that is debt. And like cancer, it spreads silently until the system can't breathe.
Quantity does not imply quality' should be tattooed on every AI tool's interface. We learned this lesson with lines-of-code metrics decades ago. Now we're relearning it with AI-generated volume. The only difference? This time, the volume can scale infinitely.
This is a really powerful way to look at the "AI coding" phenomenon:
Can we make sure that the code which AI generates can be understood by humans ?
Because when the alarms go off at 3 AM due to a production issue, it's the human developer who gets paged, not the AI agent - and, we're now creating a huge "legacy code base" through the use of AI tools - let's make sure it's high quality ...
Yes, we can ensure it but it requires intentional effort, not passive hope.
Here's how we might do it:
Evolve the code review process Add a specific 'human readability' checkpoint for AI-generated code. Not just "does it work?" but "can another human understand it without the AI present?"
Train AI to explain itself Make it a requirement that AI doesn't just generate code, but also generates explanations: why it chose this approach, what assumptions it made, what edge cases it considered. Like a junior dev explaining their PR.
Make readability a metric Just as we measure test coverage, we could measure time-to-understanding. If a piece of code takes 30 minutes for another developer to grok, that's a red flag.
Use AI as a readability reviewer—AI itself can analyze its own code and flag sections that might be confusing to humans, suggesting refactors before the code ever reaches a human reviewer.
The uncomfortable truth: This will slow things down. But the alternative is building a future where only AI can understand AI code—and that's a future where 3 AM pages become unsolvable.
100 percent ...
"The uncomfortable truth: This will slow things down" - for me that doesn't feel like an uncomfortable truth at all, on the contrary ... :-)
What WOULD be (highly) uncomfortable is if we'd generate an "iceberg" or "minefield" of hidden complexity and bugs, by not taking control of our codebase in the way you explain ...
For me this is one of the biggest insights from the whole AI coding debate :-)
Exactly! 'Slowing down' isn't uncomfortable it's an investment. The real discomfort is hitting that 'iceberg' of hidden complexity at 3 AM with no map.
You've nailed the core insight: The biggest lesson from AI coding isn't about speed—it's about control. 'Minefield' is the perfect metaphor. Every AI-generated feature we ship without understanding is a potential landmine for our future selves.
So no, slowing down isn't the uncomfortable truth. The uncomfortable truth is how easy it is to build an iceberg without realizing it. And you're right—that's the insight that changes everything.
Spot on, 100% - I think this is a "core" insight, I'd even say THE core insight ...
Great article. I think AI can speed up the development process, but many developers do it wrong, just like in your case. I often write huge prompts. No, they are not a description of a large SaaS. I am not describing a difficult task. I'm asking the AI not to create a "super cool app", but to describe how it should work. I describe to him all aspects - what the architecture should be, what libraries should be used, what functions will be there, what the interface layout will look like. What I expect from it is not a working application, but how it works. After I have clearly directed the AI in the right direction, I ask it to write a minimal implementation + a description of how it works, what needs to be added, and create a TODO sheet. The application may not work, or work crudely, but thanks to the AI I consulted, I know which direction to go. This may seem slow, since it takes not 30 minutes to explain, but much more, but I have a minimal structure, a minimal raw prototype, and most importantly, a well-thought-out plan, and not random thoughts from my head.
If you use AI as a teacher, advisor, whatever you want to call it, then AI can become your fulcrum, instead of your "boss".
This is a fantastic perspective. You've articulated something crucial using AI as a consultant rather than a boss is the real paradigm shift. Most developers treat AI like a code factory: build this app. But you're treating it like a senior architect: 'here's my thinking, help me refine it before we build.
The distinction you're making describing how something should work, not just what is exactly where AI stops being a toy and starts being a leverage point. The TODO sheet + minimal implementation approach is brilliant. It turns AI from a black box output generator into a thinking partner that helps you structure your own understanding.
Question for you: Have you ever prompted AI to critique your architecture or suggest alternatives you hadn't considered? Sometimes the biggest value isn't in getting the plan validated, but in discovering approaches outside your own mental models. That's where AI truly becomes a teacher.
"critique your architecture" - well, I myself know the answer to this question. My code is terrible and I get it, I don't follow pep8, I don't follow the architecture. I've literally been trying to refactor my app for a month now and it's feeling more like a piecemeal mess. Unfortunately, the AI is useless here, I have too much code for it to convert, so I have to do almost everything myself, and even if it manages the amount of code, it gives a vague structure and stupid advice on what to move where, and not how to properly organize dependencies.
"suggest alternatives" - This was a common occurrence at the beginning of my project, but over time it became more linear because I had the groundwork for something more. Of course, I doubt some libraries now, like the
pyimguibinding, but replacing them in such a pigsty as my code is a big risk at this stage, so I put off all alternatives for a bright (or not so bright) future.This is the reality nobody talks about. AI is magical until you hit 10,000+ lines of what you aptly call a pigsty. Then it becomes useless because it lacks context. The 'vague structure and stupid advice' problem is real AI suggests moving things around without understanding why the dependencies exist in the first place.
Here's something that might help: Instead of feeding AI your entire codebase (which it can't handle), try feeding it just the interfaces between modules. Show it function signatures, data flow, and dependency graphs. Ask it to identify circular dependencies or suggest where boundaries should be. AI doesn't need to see the implementation to critique the architecture it just needs to see the connections. I've done this successfully with a legacy Django project that was too big for AI to consume whole.
Also, you're absolutely right about not swapping libraries like pyimgui mid-stream. That's a future self' problem. First get some tests in place (even if they're ugly), then refactor. Tests are the safety net that lets you make changes without fear. Without them, every change is a gamble.
Thanks for the advice! By the way, I never used tests. This may seem strange, but I have no explanation for it. I just didn't know how to use them. Perhaps it was an oversight. Yes, live and learn.
Exactly. AI is also improving in all areas too.
Exactly. And that's why 'AI can't do X' statements have a short shelf life. What AI can't do today, it will do tomorrow. The real question: what will humans do then?
So true. Only a few devs I know think like us. Heads are firmly in the sand
10%
The 90% number is misleading because it conflates "lines of code generated" with "working systems shipped." AI can generate code fast. It can't architect a system, debug edge cases reliably, or understand why the business logic needs to work a certain way.
What actually changes: the job shifts from writing code to designing systems and reviewing output. The developers who treat AI as a thinking partner instead of a replacement will be fine. The ones waiting for it to do everything will get stuck.
This is the clearest, most level-headed take in this entire thread. Conflating 'lines generated' with 'systems shipped' is exactly the mistake that leads to overhyped expectations and underdelivered value.
AI can't architect, can't debug edge cases, can't understand business logic that's the 10% that's 100% of the value. And that 10% is still entirely human territory. No amount of prompt engineering changes that.
The framing of thinking partner vs. replacement is perfect. The developers who treat AI as a collaborator that handles the mechanical parts while they focus on design, tradeoffs, and context they're the ones who will thrive. The ones waiting for AI to do everything will find themselves irrelevant not because AI replaced them, but because they replaced themselves.
Your closing line says it all: The ones waiting for it to do everything will get stuck.' AI won't replace developers. But developers who use AI will replace those who don't.
Totally agree. "The 10% that's 100% of the value" — that framing deserves way more attention. We're seeing the same pattern in our reviews too. The tools that generate the most code aren't necessarily the ones shipping the best products.
Exactly the 10% that's 100% of the value is the framing that should end every AI-code debate. We've been seduced by volume more code must mean more progress when in reality, most code is just noise.
Your observation is spot-on: The tools that generate the most code aren't shipping the best products. Because products aren't built by lines of code. They're built by decisions. And those decisions what to build, what to leave out, when to stop are still human.
Maybe we need a new metric: not lines of code,' but decisions per line. The code that embodies a hard-won decision is worth 100x more than boilerplate. AI gives us more boilerplate. We still need to make the decisions.
"Decisions per line" — that's a metric worth stealing! Might have to reference that in our next review.
Steal away! That's what metrics are for. 😄
Would love to hear how it lands in your next review curious whether it sparks different conversations than lines-of-code ever did. Let me know how it goes!
"Decisions per line" is a genuinely useful reframe. It shifts the conversation from productivity theater to actual engineering judgment — which is exactly where the value sits in an AI-augmented workflow. Curious whether you think that metric would change how teams evaluate AI coding tools too.
If 90% of the code is generated, the other 10% becomes 100% of the value. Our job shifts from being syntax mechanics to system architects. We stop typing and start deciding. The 'What' and 'Why' finally become more important than the 'How.' We at openclawcash.com understood that and implement it in our Dev flow
This is brilliantly phrased If 90% is generated, the other 10% becomes 100% of the value. That's the new math of software development. From syntax mechanics to system architects—that's the transition every developer needs to make.
We stop typing and start deciding that line captures the entire paradigm shift. The keyboard becomes less important than the brain. The What and 'Why' finally take their rightful place above the 'How.
Love that openclawcash.com has embedded this into your Dev flow. This isn't just a process change it's a mindset shift. And the teams that embrace it early will define the next decade of software.
This resonates hard — and not just for code. I'm seeing the exact same pattern with AI-generated content at scale.
I run a 100k+ page multilingual site where a local LLM generates the analysis text for every stock page across 12 languages. The "90% AI-generated" reality is already here for content. And the lesson is identical to yours: the value isn't in the generation, it's in knowing what to generate and whether the output is actually good.
Your filtering system story is my content pipeline story. Early on I let the LLM generate thousands of pages without deeply understanding the output patterns. Google crawled 51,000 of them and rejected them all — "crawled, not indexed." The AI produced content that looked right, passed basic checks, but lacked the quality signals that matter. I had built 50,000 pages I didn't really understand.
The fix was the same as Rohan's approach: slow down the acceptance. I now audit samples from every batch, check for factual accuracy against the actual financial data, and verify that the analysis says something a human analyst would actually find useful — not just something that reads well.
The 19% slower finding from the RCT you cited maps perfectly to content too. When I added human review checkpoints to the pipeline, throughput dropped but the pages that made it through started actually getting indexed. Slower acceptance, better outcomes.
Your point about the 10% being "everything that matters" is the key insight. For code it's architecture and debugging. For content it's editorial judgment and domain expertise. The AI handles volume — the human handles value.
This is a brilliant parallel code and content, same pattern, same problem. The crawled, not indexed story with 51,000 pages is haunting. You built 50,000 pages you didn't actually understand. That's the perfect metaphor for AI's illusion of productivity.
Your key insight: The AI produced content that looked right, passed basic checks, but lacked the quality signals that matter.' This is AI's most dangerous trait it's a master of plausibility, not truth. It generates things that seem correct, but correctness isn't the same as quality.
The fix you implemented sampling, factual accuracy checks, asking 'would a human analyst find this useful?' is exactly the 'human in the loop' that turns volume into value. 19% slower acceptance, 100% better outcomes. Worth every second of slowdown.
Question for you: How do you scale this audit process? As AI generates more content, human review becomes the bottleneck. Do you see a role for AI-assisted auditing (another AI checking the first AI's work)? Or does that risk creating a feedback loop of 'plausible but not quite right' content? Also, have you noticed any language-specific issues does the AI perform differently in some of those 12 languages vs. others?
Great questions — scaling the audit is exactly where I'm stuck right now, so I'll share what's working and what isn't.
For scaling: I use a tiered approach. Tier 1 is automated — schema validation, data freshness checks (is the stock price from today or last month?), structural checks (does every page have the required sections?). This catches maybe 60% of issues without human eyes. Tier 2 is statistical sampling — I pull random pages from each batch, compare the generated analysis against the actual financial data from the API, and flag any batch where the error rate exceeds a threshold. Only Tier 3 is full human review, reserved for new template types or when Tier 2 flags something.
On AI auditing AI — I actually do this for one specific task: checking whether the generated text contradicts the numerical data on the same page. A second model reads the page and answers "does the analysis match the numbers?" This works because it's a narrow, verifiable question with a ground truth. But for subjective quality — "is this analysis actually useful to an investor?" — I agree with you, AI checking AI just creates a plausibility echo chamber. The second model has the same blind spots as the first.
Language-specific issues: absolutely yes. English output is consistently the strongest — richer vocabulary, more nuanced analysis. German and Dutch are solid (maybe 85% of English quality). Romance languages (French, Spanish, Portuguese) are decent but tend toward more generic phrasing. The real drop-off is in languages like Polish, Turkish, and Korean — the model produces grammatically correct text but the financial terminology gets wobbly. I've started using language-specific prompt templates with domain glossaries for the weaker languages, which helps but doesn't fully close the gap.
This is incredibly practical and valuable thank you for sharing such detail. Your tiered audit system is exactly how this should scale. Tier 1 catching 60% automatically, Tier 2 sampling for statistical confidence, Tier 3 only for exceptions—that's a model worth copying.
The AI-auditing-AI insight is crucial: it works for narrow, verifiable questions (like 'does text match numbers?') but fails for subjective quality because it's just creating a plausibility echo chamber. That distinction matters AI can check facts, but it can't judge value.
The language hierarchy you've observed is fascinating: English strongest, German/Dutch solid, Romance languages generic, and the drop-off in Polish/Turkish/Korean with wobbly terminology. This mirrors what many teams are seeing—multilingual AI claims often overpromise. Language-specific prompts and glossaries help, but as you said, the gap remains.
Question for you: Have you considered fine-tuning smaller models specifically for financial content in those weaker languages? Or using a two-model approach—one for generation, another (maybe a smaller, specialized model) just for terminology verification in those languages?
The two-model approach is actually where I'm heading next — and your framing is sharper than what I had in mind.
Right now I generate everything with a single Llama 3 instance, which works well for English and Germanic languages but struggles with financial terminology in Polish, Turkish, and Korean. Fine-tuning a smaller model per language sounds ideal in theory, but the economics don't work at my scale yet — I'd need labeled training data in each language, and the financial terminology corpus for Polish stock analysis doesn't exactly exist on Hugging Face.
What I'm leaning toward instead is closer to your two-model idea but cheaper: generate in the target language with the main model, then run a separate "terminology audit" pass where a second model (or even the same model with a different prompt) checks a curated list of ~200 financial terms per language against what was generated. Did it say "rendement" or did it hallucinate an English-Dutch hybrid? That's a verifiable question — exactly the kind of narrow check where AI auditing AI actually works.
The interesting insight from running this at scale: the weaker languages don't just produce worse text — they produce differently wrong text. English hallucinations tend to be plausible but fabricated numbers. Polish hallucinations tend to be correct numbers wrapped in grammatically correct but semantically weird financial phrasing. Different failure modes need different audit strategies.
Your distinction between "AI can check facts but can't judge value" is the key principle here. I'm building the audit pipeline around it.
This is such a smart evolution of the idea. The terminology audit pass with a curated list of 200 financial terms per language is brilliant it's narrow, verifiable and cheap. Exactly where AI-auditing-AI works best. And you're right, it doesn't need a separate fine-tuned model; a different prompt on the same model can do the job.
The most valuable insight: Weaker languages fail differently. English hallucinations: plausible but wrong numbers. Polish hallucinations: correct numbers wrapped in terminologically weird phrasing. That's a profound observation. Each language has its own 'failure personality,' and audit strategies need to be tailored accordingly. A one-size fits-all quality check will miss the Polish specific issues.
Your point about fine-tuning economics is spot-on. Without labeled data, the two-model (generate + audit) approach isn't just practical it might be more adaptable. You can evolve the audit prompts without retraining anything.
This whole approach narrow, verifiable checks for facts and terminology, leaving subjective quality to humans feels like the right architectural principle for AI content at scale. You're not just building a pipeline; you're building a philosophy around where AI can and can't be trusted.
Are you logging which terms fail most often per language? That data could eventually become the training set for a lightweight terminology model, if you ever decide fine-tuning becomes viable.
Yes — we log everything. Every term that fails validation gets tagged with the language, the failure type (hallucinated number, unit confusion, cultural mismatch), and the source sentence. After a few months we had enough signal to build per-language "known fragile" lists.
The interesting finding: the failure logs revealed clusters, not random noise. Japanese failures concentrate around Western financial concepts that don't map cleanly (like "market cap" vs the Japanese convention of expressing company size). Dutch failures cluster around decimal/comma notation bleeding into the prose. Portuguese has a whole category around formal vs informal financial register.
We haven't fed these back into fine-tuning yet — right now they drive the audit rules directly. But you're right that this is basically a curated training dataset being built organically. The question is whether fine-tuning on failure cases would make the base model worse at the languages where it already performs well. That's the experiment I want to run next.
The philosophy framing you mentioned resonates. What we're really building is a trust boundary map — here's where the model is reliable, here's where it needs a human check, and here's where it shouldn't be used at all. That map looks completely different per language, which I think most people deploying multilingual AI don't appreciate.
This is extraordinarily valuable work. The systematic logging, tagging, and clustering of failures you're not just building an audit system; you're building a taxonomy of AI failure modes across languages. That's a contribution to the field, not just your project.
The cluster findings are fascinating: Japanese struggles with Western financial concepts, Dutch with decimal bleeding, Portuguese with register. Each language has its own error fingerprint.' This kind of granular insight is exactly what's missing from most multilingual AI deployments. People assume multilingual means equally capable in all languages you're proving it means fails differently in each.
The fine-tuning dilemma is real: training on failure cases might improve weak languages but degrade strong ones. Catastrophic forgetting isn't just theoretical it's a practical risk. Have you considered language-specific adapters (LoRA) instead of full fine-tuning? That way you could improve Polish without touching the English weights.
And yes trust boundary map' is the right framework. Visualizing where the model is reliable vs. where it needs human oversight vs. where it shouldn't be used at all. That map is different for every language, and most people deploying multilingual AI don't even know it exists. You're not just solving your own problem you're creating a blueprint for responsible multilingual AI deployment.
AI can generate the code, but still; we have to work on the complete optimisation of the development process. The old model of just building the code is over, now we need to work on building the intelligence system.
building the intelligence system that's exactly the right framing.
the shift isn't from coding to not coding. it's from building features to building the system that builds features.
that's a fundamentally different skill set and most developers haven't even started making that transition yet.
Well said, that’s exactly the shift. It’s no longer about building features, but designing systems that can reliably produce and evolve them.
That requires a different skill set, architecture, workflows, and evaluation, something many developers are only beginning to explore.
You're absolutely right about what you're saying. AI will automate many tasks in IT that are done today and in the past. And when that happens, it will transform into a world of developers who can understand the code, perform root case analysis, and find solutions when the system goes down at 2 AM. A small community with high competence, capable of understanding the code and taking action, will continue to do the work. However, what I'm worried about is the situation of junior graduates like myself with very little experience (1 year) not being able to find a place in the industry. Opportunities aren't given to junior employees in the sector. Therefore, we can't gain the experience that senior employees gain because we're not given the opportunity. At this point, what I'm wondering is what will happen to juniors.
this is the part of the conversation that genuinely keeps me up at night.
the catch-22 is real you can't get experience without opportunities, and opportunities are disappearing because "AI can do it."
but here's what i actually believe: the developers who will matter most in 5 years are the ones who understand systems deeply. and the fastest way to build that understanding right now isn't through junior roles it's through building in public, contributing to open source, and writing about what you're learning.
the path is harder than it used to be. but it still exists. don't stop. 🙏
Contributing to open-source projects, developing your own projects, and sharing what you've learned on platforms like Medium are truly things that can make a difference. I hear this from different people in the industry. These are some of the best ways to make our knowledge visible.
But there's another side to the coin: there are many juniors and recent graduates like me. We all do similar things, and often we wait in line with hope. If more people find jobs by following this path, this approach will be considered "the right thing to do." But for those who do the same thing but can't find a job, this situation can turn into the thought, "there's no place for me in this industry."
The bigger and more disturbing reality is this: beyond these efforts, with the impact of AI, not only juniors but also many mid-level and senior-level professionals will struggle to find jobs in this industry over time. This is the real concern.
Because if this scenario occurs and a person's only skill is coding, these people may become unemployed and struggle to make ends meet.
Today, everyone is discussing the question, "Will AI take our jobs?" But perhaps that's not what we should be talking about anymore. The real question should be: If this happens, what will the unemployed IT workers do? Which areas will they be directed to? How will this transformation be managed?
We really need to start developing concrete ideas on this.
you've moved the conversation to exactly where it needs to go.
will AI take our jobs" is the wrong question because it's passive. the right question yours is what happens to the people it displaces and who is responsible for managing that transition.
the honest answer is nobody has a good plan for this yet. the companies benefiting from AI productivity gains aren't the ones funding retraining programs. and "learn to prompt better" isn't a career transition strategy for a 45-year-old mid-level developer.
i don't have a clean answer. but i think you're right that we need concrete ideas and that starts with people asking the question you just asked, loudly and repeatedly, until someone with actual power has to respond.
i hope we find a valid answer soon
Some comments may only be visible to logged-in visitors. Sign in to view all comments.