Five quality dimensions every synthetic financial dataset must pass

WealthSchema StaffSynthetic data, R&DMay 8, 20267 min read

The vendor data sheets all read the same. "Statistically faithful." "Privacy-preserving by design." "Production-ready." None of those phrases survive a serious procurement review, and none of them help a fintech buyer decide whether the corpus on the shared drive is going to back a model that ships next quarter or quietly poison the backtest.

This article is the rubric we hand procurement teams that have been burned before. Five dimensions, two test queries each, and the failure modes — population-correlation scores at 0.95 paired with broken cross-field invariants, k-anonymity claims at k=5, edge-case frequencies at zero — that disqualify a dataset before any contract is signed.

Why a five-dimension rubric

Most synthetic-data quality frameworks in the academic literature use one or two metrics — typically a privacy bound and a population-level correlation score. Those metrics are the floor, not the ceiling. A dataset can score 0.95 on every population correlation in a Pearson matrix and still be unusable for the use case you actually need it for, because the dependencies that matter to your engine are not the population correlations the score measures.

The five-dimension rubric below is the result of three years of post-mortems on synthetic-data deployments that went sideways. Each dimension corresponds to a class of bug that population-level metrics did not catch.

Dimensions

Each addresses a distinct failure class

Test queries

Two per dimension, runnable in any SQL client

Pass threshold

≥ 4 / 5

Below this, the dataset is unsuitable for production use cases

Time to evaluate

~1 day

Once the test queries are written; reusable across vendors

Dimension 1: structural integrity

Does the data satisfy the identities a working engine would assume?

The first thing every wealth-tech engine does on receiving a household record is sanity-check it. Net worth = assets − liabilities. Gross income − deductions = AGI. Income year-over-year ± life-event adjustments = within-band. If those identities don't hold to the dollar, the engine either rejects the record (best case) or proceeds and produces output that is silently wrong (worst case).

Formula

The four core identities

net_worth = Σ assets − Σ liabilities
agi = gross_income − above_line_deductions
taxable_income = agi − (greater(standard_deduction, itemized) + qbi_deduction)
cash_flow = inflows − outflows = Δ(cash_position)

Σ assets: = All accounts, all asset types, including illiquid (real estate, private equity, restricted stock)
Σ liabilities: = Every debt instrument including HELOC drawn portions, margin balances, deferred tax
qbi_deduction: = Section 199A; this is the identity that breaks most often in K-1 households
Δ(cash_position): = Month-over-month change in checking + savings cash balances

If a vendor's dataset fails any of these identities for more than 0.1% of records, ship it back. Vendors at 5–8% identity failure may claim 'data is internally consistent.' It is not.

Test query 1. For every household, compute assets − liabilities and compare to the reported net_worth field. Histogram the absolute error. A correct dataset has 99.9%+ of records within ±$1; the rest are explainable by FP rounding on long account chains. Related: edge cases financial test data corpus. Anything wider is a bug.

Test query 2. For every multi-month longitudinal record, compute cash_position[m] − cash_position[m−1] and compare to inflows[m] − outflows[m]. The error margin should be ±5% or less for routine months, with explicit larger excursions only for documented one-time events (real-estate transactions, RSU vests, inheritance receipts).

Dimension 2: distributional fidelity

Does the joint distribution of fields match the population the dataset claims to model?

Single-field histograms are easy. Joint distributions across pairs of fields are harder. Joint distributions across three or more fields are where most synthetic datasets fall apart.

Test query 3. Histogram (age_band × net_worth_band × state) and compare against a public reference (FRB SCF, IRS SOI). The KL divergence should be under 0.15 for a well-tuned dataset; over 0.4 means the dataset has the wrong demographic shape and any model trained on it will inherit that shape.

Test query 4. Compute the conditional distribution P(account_type | age × income). A 65-year-old with a $200K income should have a 70%+ probability of a 401(k) or IRA. A 24-year-old with a $40K income should have a 30%+ probability of having no retirement account. Datasets that get conditional distributions wrong have not been calibrated against household survey data, full stop.

Dimension 3: edge-case coverage

Does the dataset include the edge cases your engine has to handle?

Almost every catastrophic failure of a wealth-tech engine in production is on an edge case the test corpus didn't cover. Multi-state filers with apportionment rules. UHNW households with $1M+ capital-loss carryforwards. Single-state ZIP-level sales tax (Alaska, Delaware, Montana, New Hampshire, Oregon — and the local-tax surprise of Alabama and Louisiana). NIIT triggers at AGI > $250K MFJ. IRMAA bracket transitions on a one-time event year.

Test query 5. Count households where state-of-residence at end of year ≠ state-of-residence at start of year, broken down by reason (move, retire, college). Should be 2–4% of any general population dataset. If it's < 0.5%, the dataset is missing the move-year filer scenario entirely.

Test query 6. Count households with at least one of: capital loss carryforward > $500K; QBI deduction with W-2 wage limitation binding; foreign tax credit > $0; AMT exposure; QSBS holding. Related: charitable remainder trust CLAT modeling, RMD age 73 SECURE 2.0, and UPIA trust accounting fiduciary. These are not unusual in a representative wealth-tech corpus — they should collectively appear in 8–15% of records. Below 5% means the long tail is not represented. See crypto DeFi tax engine and insurance illustration engine edge cases for parallel coverage problems.

Dimension 4: longitudinal continuity

Do the temporal records hold together across snapshots?

Annual data hides cash-flow seasonality. Monthly data exposes it — and exposes the fabrications that single-call generators introduce when they have to produce 50+ snapshots per household.

Failure mode 1
Seam discontinuities
Month 32 ending balance ≠ month 33 starting balance. Caused by chunked generation pipelines that don't enforce the seam invariant.
Failure mode 2
Magic money
Cash position increases without a corresponding inflow line. The most common LLM-pipeline bug.
Failure mode 3
Identity drift
Net worth identity holds at month 1 but drifts past month 24. Means the validator only checked endpoints.
Failure mode 4
Rate-of-change anomalies
Account balances changing at 3-sigma+ rates with no event flag. Real households rarely move 30% of net worth in a month — synthetic ones, depressingly often.

Test query 7. For every month boundary, compute the absolute difference between closing_balance[m] and opening_balance[m+1] for every account. Should be exactly zero (or within FP epsilon) for non-transfer accounts. A vendor that ships seam mismatches is shipping a dataset where time-series logic will break in non-deterministic ways downstream.

Test query 8. Histogram month-over-month percent change in net worth. The vast majority of values should fall in ±5%. Excursions beyond ±25% should correlate with explicit life events in the household record. Unexplained 25%+ excursions are generator hallucinations, not realistic outliers.

Dimension 5: provenance and reproducibility

Can the dataset be regenerated, audited, and traced?

This dimension is what separates a research artifact from a production-grade product. A synthetic dataset that cannot be reproduced from a versioned manifest fails any serious model-risk review (SR 11-7, OCC 2011-12, equivalent in the EU and UK). See SR 11-7 model risk management for the regulator-facing checklist. A synthetic dataset whose generation pipeline cannot be inspected fails any algorithmic-fairness audit.

	Reproducible	Inspectable	Audit-ready
Versioned prompts + seeds	Yes — bit-for-bit	Yes	Yes
Versioned validators	Yes	Yes	Yes
Black-box generation API	Conditional — only if the API is versioned	No	Risky
Untracked manual edits	No	No	Disqualifying

Test query 9. Ask the vendor for the manifest of the corpus you're considering. A real manifest has: model version, prompt version hashes per stage, validator version, random seed, archetype distribution, generation date. If the vendor cannot produce it within 24 hours, the corpus was probably hand-tweaked at some point and the reproducibility story is not real.

Test query 10. Pick a single household from the corpus. Ask the vendor: "Show me the prompt, the model output, and the validator output for this record." A vendor that has the artifacts can produce them in minutes. A vendor that has to "go check" is admitting they don't keep them.

Putting the rubric together

The five dimensions are not equally weighted. Structural integrity (Dimension 1) is a hard gate — fail it and nothing else matters because the dataset will produce wrong results before any model touches it. See privacy utility fidelity tradeoffs for the upstream design decision. The other four are soft gates with weighting that depends on use case.

Formula

Suitability score

suitability = is_pass(structural) × (0.30 × distributional + 0.25 × edge_case + 0.25 × longitudinal + 0.20 × provenance)

is_pass(structural): = 1 if structural identities hold across 99.9%+ of records, 0 otherwise
distributional: = 0–1, where 1.0 = KL divergence < 0.15 against public reference data
edge_case: = 0–1, where 1.0 = at least 8 of 10 named edge cases present at population-realistic frequency
longitudinal: = 0–1, where 1.0 = all four longitudinal failure modes absent
provenance: = 0–1, where 1.0 = full manifest, versioned prompts, on-demand artifacts

Example

A vendor that scores 1.0 on structural, 0.85 distributional, 0.6 edge-case, 0.9 longitudinal, 0.7 provenance:
1.0 × (0.30×0.85 + 0.25×0.6 + 0.25×0.9 + 0.20×0.7) = 0.77
Above 0.75 is production-grade for general use; below 0.6 fails procurement; the 0.6–0.75 band is conditionally usable for the use cases that match the strong dimensions.

What buyers do with this in practice

Procurement teams can use this rubric three ways. See how synthetic data shortens QA cycles for the downstream benefit. First, as a vendor scorecard during evaluation — score every candidate, hand the spreadsheet to the architecture review board. Second, as an internal QA artifact once a vendor is selected — re-run the ten queries on every refresh of the corpus and track the scores over time. Third, as the basis for a written sign-off when the dataset enters production, which model-risk and audit teams universally appreciate.

The rubric is not perfect. Use cases that depend on adversarial signal (fraud detection, suspicious-activity monitoring) need an additional dimension we haven't covered here. See 12 transaction archetypes for fintech testing and training fraud detection with synthetic. Use cases that depend on demographic-conditional fairness need an explicit fair-lending audit dimension. Adapt accordingly.

What the rubric is not is optional. A team that adopts synthetic financial data without a formal evaluation framework is going to discover the failure modes the framework would have caught — except they'll discover them in production, six months in, after the model has already been signed off. Related: 8 synthetic data mistakes to avoid and how synthetic financial data is generated.

Key takeaways

Vendor data sheets do not survive procurement review. A five-dimension rubric does. See [what is synthetic financial data](/articles/synthetic-financial-data-primer) for the upstream primer.
Structural integrity is a hard gate. Distributional fidelity, edge-case coverage, longitudinal continuity, and provenance are weighted soft gates.
Pairwise correlation scores are insufficient — three-way joint distributions are where most synthetic datasets fail.
Edge cases are the most common point of catastrophic failure in production. Score the dataset against the edge cases your engine actually has to handle. See [10 edge cases your test corpus must include](/articles/10-edge-cases-wealth-test-corpus) for the inventory.
If the vendor cannot produce a per-record prompt + output + validator artifact, the reproducibility story is not real.

Frequently asked questions

Can we run this rubric ourselves or do we need a synthetic-data specialist?+

The first four dimensions are runnable by any engineering team comfortable with SQL. The provenance dimension requires the vendor's cooperation — it's a procurement artifact, not a query. Related: [GLBA compliance fintech](/articles/glba-safeguards-rule-implementation-guide).

How does this rubric apply to non-wealth fintech use cases — payments, lending, fraud?+

Dimensions 1, 2, 4, and 5 transfer cleanly. Dimension 3 (edge cases) needs replacement: instead of UHNW carryforwards and IRMAA brackets, the relevant edge cases are decline reason codes for lending, chargeback patterns for payments, and adversarial signal for fraud. Detailed in [AML transaction monitoring typologies](/articles/aml-transaction-monitoring-engine-design). The structure of the rubric stays the same; the inventory of edge cases changes per use case. Companion pieces: [robo-advisor synthetic household testing](/articles/building-robo-advisor-synthetic-households) and [mortgage synthetic data edge cases](/articles/stress-testing-mortgage-origination-engine).

What's a reasonable cost for a vendor that passes the rubric at 0.85+?+

Fully-validated production-grade synthetic financial corpora run $5K–$50K depending on archetype coverage and refresh frequency. Vendors charging meaningfully less are usually skipping one of the dimensions — most often validation strictness or edge-case coverage. Vendors charging meaningfully more are usually pricing on enterprise procurement processes rather than data quality.