Engineering leaders ask us a version of the same question every week. "We need realistic test data for the new credit engine / robo-advisor / lending model. The compliance team won't let us use a copy of production. What's the right shape of synthetic to look at?"
This article is the answer we wish we had three years ago: a definition compliance can sign off on, the four architectures every production vendor uses (and the bug class each ships outside its band), and the eight questions we now put in front of every prospective buyer before the procurement conversation starts.
What synthetic financial data actually is
Synthetic financial data is data that was generated, not collected. It describes households, balance sheets, transactions, lots, statements, claims, and policies that look and behave like the real ones your engine has to handle, but that have no causal link to any real person, account, or institution. The records are internally consistent — assets minus liabilities equals net worth, gross income minus deductions equals taxable income, every dollar moved has a corresponding ledger entry — but the entities described do not exist in any registry, custodial system, payroll database, or property record.
The crucial property is non-derivability. A correctly built synthetic dataset cannot be reverse-engineered to identify a real person, even by an attacker who has access to auxiliary data. Anonymized real data fails this test the moment a re-identification attack succeeds; synthetic data passes it by construction.
What synthetic data is not
Three things buyers regularly mistake for synthetic data, all of which fail in production for different reasons.
Anonymized data is not synthetic. Stripping names off a CSV still leaves the joint distribution of every other field intact. Re-identification is a function of how much joint distribution you preserve, and financial datasets have to preserve a lot of it to be useful.
Mocked data is not synthetic. Faker libraries and Math.random() produce plausible-looking strings (a 9-digit SSN-shaped value, a 16-digit credit-card-shaped value) without preserving the relationships that make them useful for testing. A mocked household has a $4M brokerage balance with a $30K salary and a 24-year-old head of household — three values that are individually plausible and jointly nonsensical.
Aggregated data is not synthetic. A dataset of "average balances by ZIP code" is privacy-preserving but useless for unit-level testing. Engines run on records, not aggregates.
| Privacy | Joint-distribution fidelity | Suitable for | |
|---|---|---|---|
| Anonymized real | Weak — re-identifiable | High — it's the original | Discouraged for wealth use cases |
| Mocked | Strong — no real person | None — fields are independent | Smoke tests, schema validation |
| Aggregated | Strong — k-anonymous bins | Marginal only | Population-level analytics |
| Synthetic (good) | Strong — no joinable identifier | High — preserved by construction | Backtesting, model training, audits |
Why fintech is harder than the canonical synthetic-data examples
The synthetic-data literature draws most of its examples from healthcare and computer vision. Both are easier than fintech, for reasons that matter when you are evaluating vendors.
A medical record has perhaps a hundred fields with relatively local dependencies — a diagnosis code constrains a small set of medications, a lab result lives in a defined range, an admission date precedes a discharge date. A wealth profile has thousands of fields with global dependencies that span years. A 401(k) contribution in February constrains a tax filing in April, which constrains an estimated payment in June, which interacts with a Roth conversion window in November, which depends on an RMD calculation that itself depends on a beneficiary structure set up in 2014. Every long-range dependency is a chance for the generator to produce a household that looks fine in any single field and is internally inconsistent in a way that breaks downstream logic.
The other thing that makes finance harder is the cost of getting it wrong. A synthetic patient with a contradictory diagnosis breaks one screen of an EHR demo. A synthetic household with a contradictory tax-lot record can crash a wash-sale engine, propagate to a fee benchmark report, and end up in a Reg B audit trail. Fidelity is not a nice-to-have for fintech synthetic data — it is the product.
The eight evaluation questions
When buyers ask for an evaluation framework, this is the framework. We score every vendor (including ourselves) against each question on a 1-to-3 scale; total below 18 means the dataset is not production-grade for wealth-tech.
The buyer's evaluation checklist
- 1. Is the joint distribution defended? Show me ten field pairs and the population correlations they preserve.
- 2. Is the lot-level structure present, or only aggregate positions? Aggregates fail TLH, fail RMD math, fail Reg BI suitability scoring.
- 3. Does it handle multi-account households correctly — taxable, IRA, 401(k), HSA, joint trust, spouse's account, all linked?
- 4. Are sensitive demographic fields (race, religion, sexual orientation) absent by default and only added under explicit consent for use cases that require them?
- 5. What is the longitudinal granularity — annual, quarterly, monthly? Monthly is the floor for any cash-flow-aware use case.
- 6. Does the dataset include realistic edge cases: UHNW carryforwards, multi-state filers, single-state ZIP-level sales tax, NIIT triggers, IRMAA tier transitions?
- 7. Is the validation strict-fail? A vendor that ships records with 'one warning is OK' is shipping bugs.
- 8. Is the generation methodology versioned, reproducible, and inspectable? Black-box pipelines fail SR 11-7 and equivalent model-risk reviews.
The first three questions kill more vendors than the rest combined. Most synthetic-data products were built for general-purpose use cases — fraud detection demos, dashboard mockups, customer-support training — and never had to defend joint-distribution fidelity for tax-aware retirement planning or lot-level wash-sale tracking. They look impressive at the field-summary level and fall over the moment you query a cross-field invariant.
Where synthetic still trails real data
We are explicit with prospects about the gaps that remain. Synthetic data does not yet match real production data on three fronts:
- Long-tail behavioral patterns. A retiree who pays cash for everything, a young saver who keeps their emergency fund in a CD ladder, an immigrant household that wires money internationally on the 15th of every month — these patterns exist in real data and are underrepresented in synthetic generation pipelines because the LLMs and rules engines that produce synthetic households have weaker priors on rare behaviors.
- Adversarial signal. Real fraud, real tax-evasion patterns, real Ponzi-scheme victims — these have signal that you cannot reliably synthesize without the real examples to train on. Fraud-detection teams should pair synthetic for the bulk of their pipeline with curated real adversarial samples for the detection edge.
- Genuinely novel outliers. A 28-year-old with a $40M crypto windfall and a Section 1244 loss carryforward against a former S-corp that is itself a partner in a real-estate fund. These exist. Synthetic generators trained on archetype distributions produce fewer of them than real-world tail draws would.
What good looks like in 2026
The state of the art has moved fast. A working production-grade synthetic financial dataset in 2026 has six properties that were aspirational in 2023 and table-stakes today.
- Property 1Archetype-driven generationHouseholds are produced inside named segments with explicit population targets, not as undifferentiated draws from a model.
- Property 2Lot-level and account-linkedTax-lot acquisition history, multi-account linking, basis adjustments — every field a real engine would query.
- Property 3Monthly longitudinal60+ months of history and 36+ months of projection, validated for within-month identity and cross-month continuity.
- Property 4Strict-fail validationAny warning fails the household. No partial credit, no soft warnings shipped to buyers.
- Property 5Conditional sensitive fieldsRace, religion, gender beyond M/F absent by default; appear only as explicit overlays for the use cases that require them.
- Property 6Reproducible pipelineVersioned prompts, versioned validators, versioned random seeds. The dataset can be regenerated bit-for-bit from its manifest.
A vendor missing any one of these properties is shipping a 2022-vintage product. The cost difference between 2022-vintage and 2026-vintage synthetic data is roughly 30% at corpus scale; the quality difference is the rest of the story.
A short decision tree
The framework we walk every buyer through fits on a single page.
evaluate(dataset) = min(joint_fidelity, edge_case_coverage, lot_resolution, longitudinal_depth, validation_strictness)- joint_fidelity
- = Population-level correlation preservation across at least 10 cross-field pairs
- edge_case_coverage
- = Presence of UHNW, multi-state, NIIT, IRMAA, RMD, wash-sale edge cases
- lot_resolution
- = Lot-level (good) vs position-level (insufficient for tax engines)
- longitudinal_depth
- = Months of history × months of projection × snapshot frequency
- validation_strictness
- = Strict-fail (1.0) vs soft-warn (degrades quality, score < 1.0)
The right way to use this framework is not to score one vendor; it is to score three. The relative scores will tell you more than any individual data sheet.
Key takeaways
- Synthetic financial data is generated, not collected — and the property that matters is non-derivability from any real entity.
- Anonymized, mocked, and aggregated data are not synthetic. Each fails for a different reason.
- Fintech is harder than the canonical synthetic-data examples because dependency graphs span years and a single broken cross-field invariant can crash an entire downstream pipeline.
- Eight evaluation questions separate production-grade datasets from showroom-grade. The first three (joint fidelity, lot-level, multi-account) eliminate most of the market.
- Pair synthetic with a small curated real-data slice for adversarial signal and long-tail behaviors. The right split is closer to 90/10 than to 100/0.