Synthetic financial data, explained for engineering leaders

WealthSchema StaffSynthetic data, R&DMay 8, 20265 min read

Engineering leaders ask us a version of the same question every week. "We need realistic test data for the new credit engine / robo-advisor / lending model. The compliance team won't let us use a copy of production. What's the right shape of synthetic to look at?"

This article is the answer we wish we had three years ago: a definition compliance can sign off on, the four architectures every production vendor uses (and the bug class each ships outside its band), and the eight questions we now put in front of every prospective buyer before the procurement conversation starts.

What synthetic financial data actually is

Synthetic financial data is data that was generated, not collected. It describes households, balance sheets, transactions, lots, statements, claims, and policies that look and behave like the real ones your engine has to handle, but that have no causal link to any real person, account, or institution. The records are internally consistent — assets minus liabilities equals net worth, gross income minus deductions equals taxable income, every dollar moved has a corresponding ledger entry — but the entities described do not exist in any registry, custodial system, payroll database, or property record.

The crucial property is non-derivability. A correctly built synthetic dataset cannot be reverse-engineered to identify a real person, even by an attacker who has access to auxiliary data. Anonymized real data fails this test the moment a re-identification attack succeeds; synthetic data passes it by construction.

What synthetic data is not

Three things buyers regularly mistake for synthetic data, all of which fail in production for different reasons.

Anonymized data is not synthetic. Stripping names off a CSV still leaves the joint distribution of every other field intact. Re-identification is a function of how much joint distribution you preserve, and financial datasets have to preserve a lot of it to be useful.

Mocked data is not synthetic. Faker libraries and Math.random() produce plausible-looking strings (a 9-digit SSN-shaped value, a 16-digit credit-card-shaped value) without preserving the relationships that make them useful for testing. A mocked household has a $4M brokerage balance with a $30K salary and a 24-year-old head of household — three values that are individually plausible and jointly nonsensical.

Aggregated data is not synthetic. A dataset of "average balances by ZIP code" is privacy-preserving but useless for unit-level testing. Engines run on records, not aggregates.

	Privacy	Joint-distribution fidelity	Suitable for
Anonymized real	Weak — re-identifiable	High — it's the original	Discouraged for wealth use cases
Mocked	Strong — no real person	None — fields are independent	Smoke tests, schema validation
Aggregated	Strong — k-anonymous bins	Marginal only	Population-level analytics
Synthetic (good)	Strong — no joinable identifier	High — preserved by construction	Backtesting, model training, audits

Why fintech is harder than the canonical synthetic-data examples

The synthetic-data literature draws most of its examples from healthcare and computer vision. Both are easier than fintech, for reasons that matter when you are evaluating vendors.

A medical record has perhaps a hundred fields with relatively local dependencies — a diagnosis code constrains a small set of medications, a lab result lives in a defined range, an admission date precedes a discharge date. A wealth profile has thousands of fields with global dependencies that span years. A 401(k) contribution in February constrains a tax filing in April, which constrains an estimated payment in June, which interacts with a Roth conversion window in November, which depends on an RMD calculation that itself depends on a beneficiary structure set up in 2014. Every long-range dependency is a chance for the generator to produce a household that looks fine in any single field and is internally inconsistent in a way that breaks downstream logic.

The other thing that makes finance harder is the cost of getting it wrong. A synthetic patient with a contradictory diagnosis breaks one screen of an EHR demo. A synthetic household with a contradictory tax-lot record can crash a wash-sale engine, propagate to a fee benchmark report, and end up in a Reg B audit trail. Fidelity is not a nice-to-have for fintech synthetic data — it is the product.

The eight evaluation questions

When buyers ask for an evaluation framework, this is the framework. We score every vendor (including ourselves) against each question on a 1-to-3 scale; total below 18 means the dataset is not production-grade for wealth-tech.

The buyer's evaluation checklist

1. Is the joint distribution defended? Show me ten field pairs and the population correlations they preserve.
2. Is the lot-level structure present, or only aggregate positions? Aggregates fail TLH, fail RMD math, fail Reg BI suitability scoring.
3. Does it handle multi-account households correctly — taxable, IRA, 401(k), HSA, joint trust, spouse's account, all linked?
4. Are sensitive demographic fields (race, religion, sexual orientation) absent by default and only added under explicit consent for use cases that require them?
5. What is the longitudinal granularity — annual, quarterly, monthly? Monthly is the floor for any cash-flow-aware use case.
6. Does the dataset include realistic edge cases: UHNW carryforwards, multi-state filers, single-state ZIP-level sales tax, NIIT triggers, IRMAA tier transitions?
7. Is the validation strict-fail? A vendor that ships records with 'one warning is OK' is shipping bugs.
8. Is the generation methodology versioned, reproducible, and inspectable? Black-box pipelines fail SR 11-7 and equivalent model-risk reviews.

The first three questions kill more vendors than the rest combined. Most synthetic-data products were built for general-purpose use cases — fraud detection demos, dashboard mockups, customer-support training — and never had to defend joint-distribution fidelity for tax-aware retirement planning or lot-level wash-sale tracking. They look impressive at the field-summary level and fall over the moment you query a cross-field invariant.

Where synthetic still trails real data

We are explicit with prospects about the gaps that remain. Synthetic data does not yet match real production data on three fronts:

Long-tail behavioral patterns. A retiree who pays cash for everything, a young saver who keeps their emergency fund in a CD ladder, an immigrant household that wires money internationally on the 15th of every month — these patterns exist in real data and are underrepresented in synthetic generation pipelines because the LLMs and rules engines that produce synthetic households have weaker priors on rare behaviors.
Adversarial signal. Real fraud, real tax-evasion patterns, real Ponzi-scheme victims — these have signal that you cannot reliably synthesize without the real examples to train on. Fraud-detection teams should pair synthetic for the bulk of their pipeline with curated real adversarial samples for the detection edge.
Genuinely novel outliers. A 28-year-old with a $40M crypto windfall and a Section 1244 loss carryforward against a former S-corp that is itself a partner in a real-estate fund. These exist. Synthetic generators trained on archetype distributions produce fewer of them than real-world tail draws would.

What good looks like in 2026

The state of the art has moved fast. A working production-grade synthetic financial dataset in 2026 has six properties that were aspirational in 2023 and table-stakes today.

Property 1
Archetype-driven generation
Households are produced inside named segments with explicit population targets, not as undifferentiated draws from a model.
Property 2
Lot-level and account-linked
Tax-lot acquisition history, multi-account linking, basis adjustments — every field a real engine would query.
Property 3
Monthly longitudinal
60+ months of history and 36+ months of projection, validated for within-month identity and cross-month continuity.
Property 4
Strict-fail validation
Any warning fails the household. No partial credit, no soft warnings shipped to buyers.
Property 5
Conditional sensitive fields
Race, religion, gender beyond M/F absent by default; appear only as explicit overlays for the use cases that require them.
Property 6
Reproducible pipeline
Versioned prompts, versioned validators, versioned random seeds. The dataset can be regenerated bit-for-bit from its manifest.

A vendor missing any one of these properties is shipping a 2022-vintage product. The cost difference between 2022-vintage and 2026-vintage synthetic data is roughly 30% at corpus scale; the quality difference is the rest of the story.

A short decision tree

The framework we walk every buyer through fits on a single page.

Formula

The fidelity floor

evaluate(dataset) = min(joint_fidelity, edge_case_coverage, lot_resolution, longitudinal_depth, validation_strictness)

joint_fidelity: = Population-level correlation preservation across at least 10 cross-field pairs
edge_case_coverage: = Presence of UHNW, multi-state, NIIT, IRMAA, RMD, wash-sale edge cases
lot_resolution: = Lot-level (good) vs position-level (insufficient for tax engines)
longitudinal_depth: = Months of history × months of projection × snapshot frequency
validation_strictness: = Strict-fail (1.0) vs soft-warn (degrades quality, score < 1.0)

The minimum function is deliberate: a dataset is only as good as its weakest evaluation dimension. A vendor that scores high on four dimensions and low on one is shipping a dataset that fails the use cases that depend on the weak dimension — and you don't know which ones until you find out the hard way in production.

The right way to use this framework is not to score one vendor; it is to score three. The relative scores will tell you more than any individual data sheet.

Key takeaways

Synthetic financial data is generated, not collected — and the property that matters is non-derivability from any real entity.
Anonymized, mocked, and aggregated data are not synthetic. Each fails for a different reason.
Fintech is harder than the canonical synthetic-data examples because dependency graphs span years and a single broken cross-field invariant can crash an entire downstream pipeline.
Eight evaluation questions separate production-grade datasets from showroom-grade. The first three (joint fidelity, lot-level, multi-account) eliminate most of the market.
Pair synthetic with a small curated real-data slice for adversarial signal and long-tail behaviors. The right split is closer to 90/10 than to 100/0.

Frequently asked questions

Is synthetic financial data legally classified as personal data?+

No, when correctly produced. GLBA covers Non-Public Personal Information about real consumers. GDPR Article 4(1) covers personal data about identified or identifiable natural persons. CCPA covers consumers in California. A correctly synthesized record has no real person to identify, so it falls outside all three regimes. Counsel should still confirm in writing that the dataset cannot be linked back to real records — a one-page DPIA is the standard artifact.

How is synthetic financial data different from synthetic patient data or synthetic image data?+

The dependency graph is much denser and spans much longer time horizons. Synthetic patient data has perhaps 100 fields with mostly local dependencies; synthetic financial data has thousands of fields and dependencies that span 5+ years. This makes the fidelity engineering meaningfully harder, and explains why most cross-domain synthetic-data vendors trail finance-specialist vendors on fidelity benchmarks.

Should we generate our own synthetic data internally or buy a corpus?+

If your engineering team has a multi-year roadmap that depends on a single archetype distribution and you have synthetic-data engineering on staff, build. If you need broad coverage across many archetypes, validation, and edge cases without a dedicated team, buy. The crossover is usually around 200–300 archetypes — below that, internal builds amortize; above it, a vendor's specialized validation pipeline is hard to beat.

How do we validate synthetic data for our specific use case?+

Start with the eight-question rubric in this article and add the use-case-specific invariants you care about. For TLH: lot-level resolution and wash-sale handling. For retirement: RMD math, IRMAA brackets, Roth conversion windows. For lending: fair-lending demographic distribution control. The validation suite the vendor ships is a starting point; your suite is what proves the dataset works for your engine.