Why Faker, Mockaroo, and SDV Aren't Enough — the synthetic-data maturity curve for fintech engineering teams

WealthSchema StaffSynthetic data engineeringMay 9, 20268 min read

Five teams in a fintech reach for synthetic data from five different directions: compliance needs records that satisfy an examiner; ML needs training data that won't trigger a privacy review; QA needs edge cases the supervisory engine actually fires on; sales engineering needs demo data that doesn't leak; platform needs load-test fixtures that don't melt staging.

Each team starts with whatever's nearest to hand — Faker, Mockaroo, a hand-rolled CSV, an open-source library like SDV. None of those tools are wrong. They sit at different points on a four-stage maturity curve, and most fintech teams don't notice the curve exists until the third or fourth iteration of the corpus has outgrown the third or fourth tool.

The map below names the four stages, what each one is good for, what it can't do, and the specific signals that say you've already outgrown the stage you're at.

The four stages of synthetic-data maturity

The progression looks like this:

Stage 1
Mock data
Fixed values, hand-curated. A CSV in tests/fixtures/. A seed_data.json checked into the repo.
Stage 2
Randomized data
Generated with templates. Faker and Mockaroo live here. Realistic-looking strings and numbers, no statistical relationships between fields, no household coherence, no longitudinal structure.
Stage 3
Schema-preserving synthesis
Generated by ML models trained on real production data. Tonic.ai, MOSTLY AI, Hazy, Synthesized, Gretel, SDV. Statistical fidelity to the source distribution; varying degrees of privacy guarantees.
Stage 4
Archetype-driven generation
Households, accounts, and transactions modeled from a domain ontology rather than a source dataset. Calibrated against published distributions (SCF, IRS SOI, Census), with explicit life-stage and behavioral coherence. WealthSchema is here.

Each stage solves problems the previous stage couldn't. Each stage also has costs the previous stage didn't impose. Most teams move up the curve when a specific failure forces the move — not when a roadmap planning exercise prompts it.

Stage 1: Mock data

What it looks like. A fixtures.json file with five users, two of whom have transactions. A golden_test_data.csv with thirty rows. A unit test that asserts user.balance == 1000.

What it's good for. Unit tests. Smoke tests. The first six months of any product where the data model is still in flux and you need something you can iterate against.

What breaks first. The day someone needs to test a code path that requires data the fixtures don't have. A junior engineer adds a special case to seed_data.json to cover it. Six months later there are forty special cases. The "fixtures" file is now load-bearing for production behavior, no one knows which fields can change without breaking which tests, and the file is too brittle to update.

The first migration most teams make is from Stage 1 to Stage 2 — usually triggered by a ticket that says something like "we need to test 200 users, not 5."

Stage 2: Randomized data

What it looks like. A script that calls Faker.name(), Faker.address(), Faker.ssn(), Faker.random_int(min=10000, max=500000) in a loop. Or a Mockaroo schema with thirty fields, each generated from a list or a regex.

What it's good for. Generating volume. Filling staging databases. Demo environments where realism doesn't matter beyond "names look like names." Load testing where the shape of the data matters more than what's in it.

What it can't do.

The fundamental limit of randomized data is that fields don't know about each other. A 28-year-old has $4M in a Roth IRA. A retiree files Form 8615. A single filer claims a QBI deduction on W-2 income. A mortgage holder has a 95% LTV with $2M in liquid assets. None of those are individually impossible — they're individually nonsensical given the rest of the record.

For most non-fintech use cases, this doesn't matter. For fintech, it's the whole problem. A planning algorithm that runs against incoherent households produces incoherent recommendations. A compliance test that runs against incoherent households doesn't actually exercise the rules — because the rules are calibrated against patterns that randomized data can't generate.

The other thing Stage 2 can't do is longitudinal coherence. A household's net worth at month 36 should bear some relationship to its net worth at month 1. Income should compound. Tax-deferred accounts should accumulate. Mortgage balances should amortize. Faker doesn't do time. Mockaroo doesn't do time.

Stage 3: Schema-preserving synthesis

What it looks like. You point a tool at your production database. The tool trains a model on it — historically GANs, more recently variational autoencoders or transformer-based approaches. The tool emits synthetic records that preserve the joint distribution of the source.

What it's good for. Data masking for non-production environments. Privacy-compliant analytics on derivative datasets. ML training where you want to maintain statistical relationships without exposing the underlying records. Cross-org data sharing within a regulated entity.

This is a real, valuable, technically impressive class of tooling. Tonic.ai, MOSTLY AI, Gretel, Hazy, Synthesized, K2view's synthesis layer, and SDV all live here, with meaningful differences in their privacy guarantees, their generation approach, and the use cases they target.

What it presupposes that fintech teams often can't provide.

Schema-preserving synthesis requires a source schema. Specifically, it requires a source dataset large enough, well-labeled enough, and clean enough to train a generative model on.

This is where fintech teams hit the wall. The tax-loss harvesting team needs 500 households with concentrated stock positions, lot-level cost basis, cross-account wash-sale conflicts, and QSBS attestation chains — and they need them before they have customers. The Reg BI compliance team needs households that exhibit the exact fact patterns examiners cite — concentrated holdings, age 75+, recent inheritance, cognitive decline markers — and "exact fact patterns examiners cite" isn't a query you can run against your production database, because your production database doesn't have those flags.

Schema-preserving synthesis assumes you have the data and want a privacy-safe copy of it. Fintech teams frequently don't have the data and need a defensible substitute.

Schema-preserving synthesis is excellent at preserving fidelity and useless at engineering coverage. Fintech compliance and product validation depend almost entirely on coverage.

If your production database has 50,000 households and three of them have QSBS positions, your synthetic dataset will have approximately three QSBS positions — which is not enough to test a QSBS engine. The model can only sample from the distribution it was trained on. It cannot expand the distribution to cover edge cases the source didn't contain in volume.

Stage 4: Archetype-driven generation

What it looks like. Instead of training a model on a source dataset, you specify the population from the top down. You define archetypes — "dual-income tech employee with equity," "single parent EITC-eligible," "founder with QSBS," "retiree managing IRMAA tiers" — each with a specified life stage, income range, asset mix, tax exposure, and behavioral profile. You calibrate each archetype against published distributions (the Federal Reserve's Survey of Consumer Finances, IRS Statistics of Income, Census ACS, BLS data) so the population statistics are credible. You enforce coherence rules: income → savings rate → asset mix → tax exposure → cash flow → eventual net worth at month 96.

WealthSchema's 71-archetype, 1,451-household corpus is built this way. So is the smaller end of what some boutique research shops produce for individual clients.

Archetypes

Each typed, life-stage-aware, SCF-calibrated

Households

1,451

Master Corpus, every overlay populated

Snapshots / hh

32 months × 3 longitudinal chunks

Bundles

Themed packs + full-corpus master

What this enables that the previous stages cannot.

Coverage by design. If you need 130 Reg BI suitability test households with concentrated holdings, age 75+, and cognitive markers, you specify them — you don't sample for them. If you need 60 founders satisfying the QSBS five-year holding period and the gross-asset test at issuance, you generate them. If you need 50 crypto-heavy households with DeFi LP positions and 1099-DA reconciliation gaps, you generate those.

Domain coherence by construction. The archetype isn't a sampled point on a high-dimensional manifold; it's a typed, documented, life-stage-aware household with internal logic that domain experts can audit. Every field has a derivation. Every relationship is explicit. Every overlay (tax law year, geography, conditional demographic populations) is documented.

Longitudinal structure. Households don't just exist at a point in time — they have 96 monthly snapshots, with income, expenses, asset values, and tax events that evolve coherently across the timeline.

Most importantly: defensibility under examination. When an SEC or FINRA examiner asks how the test corpus was built, "we sampled from a model trained on production data" is an awkward answer if the production data is the thing under review. "We generated it from a documented archetype specification calibrated against the SCF" is a much better answer — and one that examiners generally prefer to defend rather than challenge.

What Stage 4 can't do. It can't replace your production data. It's not a privacy-safe copy of your customers; it has no relationship to your customers. For use cases where the goal is "let our analytics team work on a privacy-safe copy of our real customer base," Stage 3 is the right answer, not Stage 4. The two stages solve adjacent problems, not the same problem.

How to tell which stage you're at

A few diagnostic questions, in roughly the order teams encounter them:

Diagnostic — which stage are you operating at?

Can a developer add a new test case without modifying a fixtures file? (No → Stage 1)
When you generate 1,000 records, do they have internally consistent income / asset / tax relationships? (No → Stage 2)
When a domain expert audits a record, can you explain why each field has the value it does? (No → Stage 2 or Stage 3 with a generative-model black box)
Can you generate 100 records that match a specific compliance fact pattern (e.g., Reg BI Care Obligation triggers)? (No → Stage 3)
Can your test corpus exercise a tax-law change that takes effect next year? (No → Stage 2 or 3)
If an examiner asked how the test corpus was constructed, would the answer hold up? (No → time for Stage 4)

The last question tends to drive more migrations to Stage 4 than any other.

Where each tool actually fits

We publish head-to-head comparisons for each of the major synthetic-data tools, but here's the rough mapping to the maturity curve:

	Stage	Best fit
Faker, Mockaroo, fixtures	2	Volume + smoke tests. Wrong tool the moment domain coherence matters.
SDV (open source)	3	When you have a source dataset and want a transparent statistical clone.
Tonic.ai, Synthesized, Delphix, K2view, Privitar	3 (enterprise)	Schema-preserving synthesis with masking, provisioning, and platform features.
MOSTLY AI, Gretel, Hazy, Howso	3 (ML-first)	Privacy-preserving generation with formal privacy guarantees.
WealthSchema (and vertical specialists)	4	Archetype-driven, domain-calibrated, coverage-by-design.

The right answer for many fintech teams is "Stage 3 and Stage 4." A schema-preserving tool for masking production data in non-production environments; an archetype-driven corpus for compliance, planning-engine, and ML-validation work that production data can't safely or completely cover.

The migration that matters

Three triggers move teams up the curve, every time: a Reg B / SR 11-7 / Care-Obligation test where randomized data fails the examiner; an ML training run where the production system doesn't carry the rare class at sufficient density; a sales demo that has to look real without leaking a real customer. Each trigger surfaces the same underlying gap — the test, the model, or the demo needs coverage the previous stage couldn't engineer.

When the trigger hits, the wrong question is "which Stage 2 or Stage 3 tool should we evaluate next." The right question is whether the data problem at hand is best served by sampling from a distribution you have or constructing a population you specify. Stages 2 and 3 do the first. Stage 4 does the second. The two answer different questions, and a team that picks the wrong one re-runs the migration eighteen months later.

Key takeaways

Synthetic data is not one product category — it's a four-stage curve from hand-rolled fixtures to archetype-driven generation, and each stage solves problems the others can't.
Faker / Mockaroo break the moment fields need to know about each other (which is most of fintech).
Schema-preserving synthesis (Tonic, MOSTLY AI, Gretel, SDV) preserves the distribution you have; it cannot engineer coverage of the distribution you don't.
Archetype-driven generation produces compliance-defensible, life-stage-coherent, longitudinally-structured corpora — at the cost of having no relationship to any one firm's customers.
Most mature fintech stacks use Stage 3 and Stage 4 together: masked production for non-prod environments, archetype-driven for validation and ML coverage.

Related reading: