Serhii Kravchenko

Posted on Mar 27

How I Built a Memory System for Claude Code and Open-Sourced It

#ai #programming #opensource #productivity

Four-layer pyramid with pre-compact hooks

You open Claude Code. You work for an hour — refactoring, debugging, building something real. You close the terminal. Next morning you type claude and... it has no idea what happened yesterday.

I've been there about a thousand times. Literally.

After 1000+ sessions building content pipelines, multi-agent systems, and GEO optimization tools with Claude Code, I got fed up with the context amnesia. So I built a system that fixes it. And today I'm open-sourcing everything.

Repo: awrshift/claude-starter-kit

The Problem Nobody Talks About

Claude Code is powerful. It reads your codebase, runs tests, writes code that actually works. But it has one brutal limitation — no persistent memory between sessions.

Every time you start a new session, you're back to square one. The agent doesn't remember:

What you worked on yesterday
Which architectural decisions you made
What patterns keep causing bugs
Where you left off

So you burn 10-15 minutes every session just re-loading context. Multiply that by 5 sessions a day, 5 days a week — that's over 6 hours a month wasted on "hey Claude, remember when we..."

Most devs I've talked to handle this with a fat CLAUDE.md file. That works until it doesn't. Once your project grows past 3 weeks of work, a single instruction file can't hold everything you need.

What I Built Instead

The starter kit gives Claude Code three things it doesn't have out of the box:

1. Persistent memory — a .claude/memory/ directory with three files that survive between sessions. MEMORY.md stores long-term patterns ("this API always returns pagination headers"). CONTEXT.md is a quick-orientation card ("currently working on auth module, tests are failing"). snapshots/ keeps session backups so nothing gets lost when the conversation compresses.

2. Session continuity — a next-session-prompt.md file that acts as a cross-project hub. Each project gets its own tagged section, so multiple Claude Code windows can work on different projects in parallel without stepping on each other.

3. Hooks that protect you — a session-start.sh that shows memory summary + git status when you open a session. And a pre-compact.sh that fires before Claude compresses your conversation — it forces the agent to save context before anything gets lost.

The Four-Layer Context System

Not everything needs to load every time. The kit uses a pyramid:

L1 — Auto (every session): CLAUDE.md + domain rules + MEMORY.md. This is the agent's identity and accumulated knowledge. Loads automatically, always.

L2 — Start (session start): next-session-prompt.md + CONTEXT.md. Orientation layer — what project am I in? What's next? What happened last time?

L3 — Project (on demand): projects/X/JOURNAL.md. Each project has one file for tasks, decisions, and status. The agent reads it when you start working on that project.

L4 — Reference (when needed): Docs, snapshots, anything deep. Pulled only when relevant — keeps token usage low.

The pyramid means Claude always knows who it is (L1), quickly orients itself (L2), and dives deep only when needed (L3-L4). No wasted context window.

Three Skills Included

The kit ships with three global skills that install to ~/.claude/skills/ on first run:

Gemini — get second opinions from Google's Gemini models. Different model family = different blind spots. I use this for prompt stress-testing and hypothesis falsification.

Brainstorm — a 3-round adversarial dialogue between Claude and Gemini. Round 1: diverge. Round 2: challenge weak points. Round 3: converge on one action. For architecture decisions it's worth every token.

Design — full design system lifecycle from URL to production CSS. Extract colors, compute palettes, generate tokens, audit HTML, run visual QA loops.

How to Get Started

cp -r claude-starter-kit my-project
cd my-project
claude

That's it. On first launch, Claude reads CLAUDE.md, sees the setup instructions, and configures everything automatically — installs skills, sets up memory, initializes git, cleans scaffolding.

No manual configuration. You can start working immediately.

What Changes After a Week

The real value isn't day one. It's day seven.

By then, MEMORY.md has 15-20 verified patterns from your work. Things like "this ORM silently drops null values" or "user prefers 2-space indentation". The agent stops asking and starts knowing.

next-session-prompt.md has a clean thread of where each project stands. You switch between three projects? Each one picks up exactly where it left off.

The pre-compact.sh hook has saved your context at least twice — you didn't even notice because it just worked.

Lessons from 1000 Sessions

The agent won't use memory unless you tell it to. Claude Code has an auto-memory directory, but in my experience it stays empty. The system-level mechanism exists, but without explicit instructions in CLAUDE.md, the agent rarely writes to it. That's why the starter kit includes both the files and the instructions.

Multi-project safety matters more than you think. Two Claude Code windows editing the same file = silent data loss. The PROJECT tags solve this — each window only edits its own section.

Pre-compact hooks are essential, not optional. When Claude's conversation gets too long, it compresses the history. If your context wasn't saved before compression, it's gone.

Skills should live globally, not per-project. I tried per-project skills. Then I had 8 copies of the same Gemini skill, each slightly out of date. Global install works much better.

What's Next

The kit is MIT-licensed and contributions are welcome. Areas that need work:

More starter templates — framework-specific (Next.js, Python, Rust)
Skill discovery — better triggering descriptions
Conflict resolution — true parallel writes still need locking

If you're building with Claude Code daily, give the starter kit a try. The setup takes 30 seconds and the payoff compounds with every session.

GitHub: awrshift/claude-starter-kit

Built by Serhii Kravchenko — based on 1000+ sessions of iterative refinement with Claude Code.

Top comments (12)

Max • Mar 28

The pre-compact hook is the one insight here that doesn't get enough attention. We hit the same wall — context fills, quality drops, and the agent doesn't feel it happening. The hooks are the self-awareness the model can't provide for itself.

We've been running a similar memory system for 85+ days on a 111K-commit PHP codebase. Started with a single MEMORY.md like you describe, but it grew past useful pretty fast. What worked for us: a tiered compression pipeline — raw session buffer gets compressed daily by a small model, then weekly into a rolling summary, with a permanent layer for things that should never be forgotten. The key insight was that memory isn't an engineering problem to optimize — it's identity. Without it, every session starts as nobody.

One thing we learned the hard way: the agent won't spontaneously maintain memory quality over time. Ours would happily append to MEMORY.md until it was 2000 lines of noise. The compression step — an external process that runs between sessions — is what keeps it useful. Curious if you've hit that scaling wall yet at 1000 sessions.

Serhii Kravchenko • Mar 31

Thanks Max, really appreciate the detail here — 85+ days on a 111K-commit PHP codebase is no joke.

Your tiered compression approach (daily → weekly → permanent) is super interesting. We went a slightly different route — instead of compressing over time, we split memory into semantic tiers: a small index file (MEMORY.md, hard-capped at 200 lines) that loads every session, and then topics/*.md files for deep knowledge that only load on-demand when relevant. So the agent doesn't carry the weight of everything it knows — just the index, and it pulls details when it actually needs them.

To answer your scaling question: yes, we've blown past 1000 sessions (currently at 750+ tracked, but the real number is higher). The key thing that saved us wasn't compression — it was curation. The agent is explicitly told: "don't save session-specific stuff, only verified patterns confirmed across multiple interactions." Failed approaches get a dedicated table so the same mistakes don't repeat. Works surprisingly well without any external tooling.

That said, your point about the agent not self-maintaining memory quality is 100% real. We hit the same thing — it'll happily append forever. The hooks + explicit rules in CLAUDE.md ("keep under 200 lines, move details to topics/") act as the guardrails. Not perfect, but way better than hoping the model figures it out.

Love the "memory is identity" framing. That's exactly it.

Apex Stack • Mar 28

The four-layer context pyramid is a really elegant approach. I've been running a similar system on a large Astro project — around 89K pages across 12 languages — and the multi-project safety piece resonates hard. Early on I had two Claude Code windows editing the same CLAUDE.md and lost about 30 minutes of context notes before I realized what happened.

One thing I'd add: for projects with scheduled tasks or automated agents, having a dedicated activity log that persists outside the memory system has been invaluable. The agent writes a few lines after each run, and the next session can quickly scan what happened overnight without loading the full project journal.

Curious about the Gemini brainstorm skill — do you find the adversarial rounds actually change your architecture decisions, or is it more of a confidence check?

Serhii Kravchenko • Mar 31

Thanks for sharing the Astro project context — 89K pages across 12 languages is a serious stress test for any memory system.

The two-windows-editing-same-file problem is exactly why we built the  tag system in next-session-prompt.md. Each project gets its own fenced section, and the rule is simple: only edit within your tags. Two Claude Code windows can run in parallel on different projects without stepping on each other. It's not fancy, but it solved the data loss issue completely for us.

Your activity log idea is solid — we do something similar with JOURNAL.md per project. The agent writes a few lines after each task, and the next session reads that instead of reconstructing what happened. Lightweight and surprisingly effective.

Now, the Gemini brainstorm question — honestly, it does change real decisions, not just confirm them.

We ran a 3-round Claude x Gemini brainstorm on our design system approach. Claude wanted to use Stitch (Google's UI tool) with post-processing to fix token adherence. Gemini pushed back hard — argued for generating code directly from design tokens. We tested both. Gemini's approach won: 100% token adherence by construction vs ~70% with post-processing. That brainstorm literally replaced our entire design workflow.

Another case: content pipeline architecture. We had Gemini gates at every stage. At the prompt design phase, Gemini caught loopholes Claude missed. One finding we've confirmed multiple times: prompt quality > model quality. A basic prompt on Gemini Pro performed the same as Flash (50%). A stress-tested prompt jumped to 75-80%. The brainstorm's real value isn't "a smarter model" — it's a different model family catching different blind spots.

So yeah — definitely not just a confidence check. More like having a cofounder who thinks differently than you do.

Apex Stack • Mar 31

The project-scoped tag system in next-session-prompt.md is a really clean solution. I've been doing something cruder — separate markdown files per concern area (one for portfolio state, one for SEO metrics, one for product pipeline) so that agents can read just the context they need without loading everything. But the fenced-section approach with edit-within-your-tags is more elegant for shared state.

The JOURNAL.md pattern mirrors exactly what I use — an activity log that each scheduled agent appends to after its run, and the weekly review agent reads it all to produce a summary. The key insight is that writing is cheaper than reconstructing. Agents forget everything between sessions, so a 3-line log entry saves 10 minutes of re-discovery.

Your prompt quality > model quality finding is fascinating and matches my experience. I run content generation with a local 9B model and the output quality is almost entirely determined by how well the prompt constrains the structure, not the model's raw capability. A tightly constrained prompt on a small model beats a vague prompt on a frontier model every time for structured tasks.

The "cofounder who thinks differently" framing for multi-model brainstorming is perfect. Going to experiment with that pattern.

Iurii • Mar 29 • Edited

Claude has reference implementation of memory MCP server.

I would very much like to see a full RAG vector memory (kind of what AnythingLLM does but for code specifically).

So that I could ask about "ORM" and Claude would retrieve "database" and "migrations" topics from the memory

Serhii Kravchenko • Mar 31

Good point, Iurii — and yeah, I'm aware of Anthropic's memory MCP server reference implementation.

The RAG/vector approach is tempting, especially for the semantic retrieval you're describing (ask about "ORM" and get "database" + "migrations" back). In theory, it's cleaner than flat files.

But here's what we found in practice: for most Claude Code workflows, the overhead of maintaining a vector DB (embeddings, indexing, retrieval pipeline) doesn't pay off. Here's why:

Claude already does semantic matching — when the agent reads MEMORY.md index and sees a topic file called database.md, it knows to pull it when you ask about ORM or migrations. The model's own understanding of semantic relationships handles 90%+ of the routing without any embeddings.
The bottleneck isn't retrieval, it's curation — the hard part isn't finding the right memory, it's keeping memory clean and useful over time. A vector DB with 2000 noisy entries retrieves noisy results. Our curated 200-line index + focused topic files stays sharp because the agent is told exactly what to save and what to skip.
Zero infrastructure — no embedding model, no vector store, no indexing step. Just markdown files in a git repo. Works offline, syncs with git, readable by humans. For a solo dev or small team, that simplicity matters a lot.

That said — for large codebases where you need to search across thousands of files semantically, a vector layer absolutely makes sense. It's more of a "what scale are you at" question. For project-level memory (decisions, patterns, preferences), flat files win. For codebase-level search across 100K+ lines, yeah, embeddings would help.

Would be cool to see someone build a hybrid — flat file memory for project context + vector search for codebase navigation. Best of both worlds.

Iurii • Mar 31

Thanks for the answer — makes total sense to me.

Claude already does semantic matching. Of course, in theory a vector DB would spare context window better. I’m curious about real-world results, but not curious enough to build it myself 😀
This is a very valid topic. I’ve observed how quickly CLAUDE.md can “rot” under rapid changes (something like upgrading a framework to a new major version is one prompt away). Human-readable mismatch is easier to spot.
I don’t see the infrastructure as the issue, but plain Markdown actually benefits teams more (everyone has the same files vs. everyone having their own unique binary database).

Hamza • Apr 1

I've been working on a memory system for my agent recently, and this landed perfectly for what I was thinking. Amazing work!

klement Gunndu • Mar 28

Been running a similar layered memory setup — the pre-compact hook is the key piece most people miss. Session context compression silently drops useful state without it.

Serhii Kravchenko • Mar 31

Exactly right, klement. The pre-compact hook is the unsung hero of the whole setup.

Without it, the model just... loses things. It doesn't know compression is about to happen, so it can't prepare. The hook is literally just a reminder — "hey, save your work NOW" — but that tiny nudge changes everything. It's the difference between an agent that starts fresh every few hours and one that actually builds on previous work.

In the starter kit, both hooks (session-start.sh and pre-compact.sh) come pre-configured so nobody has to figure this out from scratch. Glad to hear it's working well for you too.

View full discussion (12 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.