Jitendra Devabhaktuni

Posted on Apr 1

How Synthetic Test Databases Replace Staging Snapshots

#syntheticdata #ai #machinelearning #webdev

Stop Writing INSERT Scripts for Test Data

If you’ve been building products for a while, you’ve probably done this dance:

New feature.
New tables or columns.
Empty staging database.
“I’ll just write a few INSERT scripts to fake some rows…”

An hour later, you’ve got a wall of SQL, a half‑realistic dataset, and the quiet feeling that none of this is going to look like production anyway.

For years this was just “how it’s done.” Today, it’s a tax.

In this post, I want to lay out why hand‑crafted test data is breaking your velocity (and your tests), and what a better default looks like.

The hidden cost of hand‑written test data

The obvious cost is time: senior engineers spending hours writing INSERTs, CSVs, or seed scripts instead of shipping features.

But the deeper costs are more dangerous:

Your data is too clean

Synthetic in the worst way: perfect dates, perfect enums, no NULL hell. Your tests pass beautifully on this happy‑path dataset and then fall over in production.
Your relationships drift

You add a new table, a new foreign key, or a new join. Did you remember to update every seed script, every fixture, every CSV? If not, you end up with orphan rows and tests that silently stop covering real flows.
Nobody owns it

Test data becomes tribal knowledge. One person “just knows” which script to run or which dump to restore. When they’re busy (or leave), test environments quietly rot.
Compliance risk

To avoid writing data by hand, teams often copy masked production snapshots into staging. Masking is rarely complete. A few columns slip through, and now PII is sitting in places it shouldn’t.

Individually these feel like minor annoyances. Together, they create a slow, constant drag on every release.

What “good” test data actually means

When people say “we need realistic test data,” they usually mean more than just random rows.

A useful test database has at least three properties:

Referential integrity

Foreign keys are valid, constraints are respected, and joins behave the way they do in production.
Realistic distributions

Data “feels” like production: skewed, messy, correlated. Not everyone signs up on the last day of the month. Not every account has exactly three users.
Designed edge cases

You see the weird stuff on purpose:
- users with 0 orders,
- accounts with 1000+ invoices,
- subscriptions with overlapping billing periods.

Most hand‑written test data does okay on (1), fails on (2), and completely ignores (3). You get just enough to demo the happy path, but not enough to trust your system.

Why staging snapshots aren’t the answer

The usual response is: “We’ll just use a masked copy of production.”

That sounds great until:

Masking doesn’t catch everything, and suddenly you have PII in non‑prod.
Schema changes make your anonymization scripts brittle.
Refreshing the snapshot becomes a mini‑project every time you want to test a new flow.
You can’t easily generate new edge cases on demand, because the data is whatever production happened to look like last week.

Staging snapshots are a snapshot of the past. Most teams need a generator for the future.

A better default: synthetic relational test databases

The alternative is to treat test data as something you generate on demand, not something you “hope is still usable.”

The workflow looks more like this:

Describe the domain you care about (in schema or plain English).
Generate a full relational database that respects your constraints.
Tune volumes, distributions, and edge cases.
Regenerate whenever the schema changes.

You get:

Consistent, repeatable datasets for local dev, CI, demos, and staging.
No real customer records outside production.
The ability to intentionally create “weird” worlds to stress your system.

This is the mental model behind SyntheholDB: describe the database you wish you had for testing, then generate it instead of hand‑coding it.

What this looks like in practice

Here’s a simple example.

Imagine you’re testing a B2B SaaS app. You might say:

“I need 200 companies, 1–25 users per company, a mix of free and paid plans, and at least 20 companies with more than 50 invoices each.”

With the traditional approach, you’d:

Create CSVs for companies, users, subscriptions, invoices.
Write scripts to import them.
Fix foreign keys when something doesn’t line up.
Iterate until the data “looks okay”.

With a synthetic test database generator, you:

Express that requirement once.
Let the tool generate all the tables and relationships.
Re‑run when you change your schema or want a different scenario.

The output becomes an asset: you can spin up identical worlds for local dev, QA, and demos, without anyone touching INSERT scripts.

How to start (even without a fancy tool)

Even if you don’t use SyntheholDB or any specific product, you can still move towards this pattern.

A few practical steps:

Define your core entities and relationships explicitly

Write down the tables and constraints that matter most for testing. This becomes your “test world” spec.
Stop editing data directly in the DB

Always go through a generator, script, or seeding process. No more manual tweaks in staging.
Design edge-case scenarios as first-class citizens

Don’t wait for production to surprise you. Decide up front which “weird” configurations your system must handle and encode them.
Separate test data from real data in your mental model

Production is for truth. Testing environments are for exploring possibilities.

Once you think in generators instead of snapshots, the value of synthetic relational test data becomes obvious.

Closing thought

If you’re still writing INSERT scripts by hand in 2026, it’s not because you enjoy it.

It’s because the alternative feels like “too much work right now.”

The truth is the opposite: the more your product grows, the more expensive hand‑crafted test data becomes.

Whether you roll your own generator or opt for a tool, it’s worth asking:

What would it look like if test databases were never a bottleneck again?

DEV Community