The vendor data sheets all read the same. "Statistically faithful." "Privacy-preserving by design." "Production-ready." None of those phrases survive a serious procurement review, and none of them help a fintech buyer decide whether the corpus on the shared drive is going to back a model that ships next quarter or quietly poison the backtest.
This article is the rubric we hand procurement teams that have been burned before. Five dimensions, two test queries each, and the failure modes — population-correlation scores at 0.95 paired with broken cross-field invariants, k-anonymity claims at k=5, edge-case frequencies at zero — that disqualify a dataset before any contract is signed.
Why a five-dimension rubric
Most synthetic-data quality frameworks in the academic literature use one or two metrics — typically a privacy bound and a population-level correlation score. Those metrics are the floor, not the ceiling. A dataset can score 0.95 on every population correlation in a Pearson matrix and still be unusable for the use case you actually need it for, because the dependencies that matter to your engine are not the population correlations the score measures.
The five-dimension rubric below is the result of three years of post-mortems on synthetic-data deployments that went sideways. Each dimension corresponds to a class of bug that population-level metrics did not catch.
Dimension 1: structural integrity
Does the data satisfy the identities a working engine would assume?
The first thing every wealth-tech engine does on receiving a household record is sanity-check it. Net worth = assets − liabilities. Gross income − deductions = AGI. Income year-over-year ± life-event adjustments = within-band. If those identities don't hold to the dollar, the engine either rejects the record (best case) or proceeds and produces output that is silently wrong (worst case).
net_worth = Σ assets − Σ liabilities
agi = gross_income − above_line_deductions
taxable_income = agi − (greater(standard_deduction, itemized) + qbi_deduction)
cash_flow = inflows − outflows = Δ(cash_position)- Σ assets
- = All accounts, all asset types, including illiquid (real estate, private equity, restricted stock)
- Σ liabilities
- = Every debt instrument including HELOC drawn portions, margin balances, deferred tax
- qbi_deduction
- = Section 199A; this is the identity that breaks most often in K-1 households
- Δ(cash_position)
- = Month-over-month change in checking + savings cash balances
Test query 1. For every household, compute assets − liabilities and compare to the reported net_worth field. Histogram the absolute error. A correct dataset has 99.9%+ of records within ±$1; the rest are explainable by FP rounding on long account chains. Related: edge cases financial test data corpus. Anything wider is a bug.
Test query 2. For every multi-month longitudinal record, compute cash_position[m] − cash_position[m−1] and compare to inflows[m] − outflows[m]. The error margin should be ±5% or less for routine months, with explicit larger excursions only for documented one-time events (real-estate transactions, RSU vests, inheritance receipts).
Dimension 2: distributional fidelity
Does the joint distribution of fields match the population the dataset claims to model?
Single-field histograms are easy. Joint distributions across pairs of fields are harder. Joint distributions across three or more fields are where most synthetic datasets fall apart.
Test query 3. Histogram (age_band × net_worth_band × state) and compare against a public reference (FRB SCF, IRS SOI). The KL divergence should be under 0.15 for a well-tuned dataset; over 0.4 means the dataset has the wrong demographic shape and any model trained on it will inherit that shape.
Test query 4. Compute the conditional distribution P(account_type | age × income). A 65-year-old with a $200K income should have a 70%+ probability of a 401(k) or IRA. A 24-year-old with a $40K income should have a 30%+ probability of having no retirement account. Datasets that get conditional distributions wrong have not been calibrated against household survey data, full stop.
Dimension 3: edge-case coverage
Does the dataset include the edge cases your engine has to handle?
Almost every catastrophic failure of a wealth-tech engine in production is on an edge case the test corpus didn't cover. Multi-state filers with apportionment rules. UHNW households with $1M+ capital-loss carryforwards. Single-state ZIP-level sales tax (Alaska, Delaware, Montana, New Hampshire, Oregon — and the local-tax surprise of Alabama and Louisiana). NIIT triggers at AGI > $250K MFJ. IRMAA bracket transitions on a one-time event year.
Test query 5. Count households where state-of-residence at end of year ≠ state-of-residence at start of year, broken down by reason (move, retire, college). Should be 2–4% of any general population dataset. If it's < 0.5%, the dataset is missing the move-year filer scenario entirely.
Test query 6. Count households with at least one of: capital loss carryforward > $500K; QBI deduction with W-2 wage limitation binding; foreign tax credit > $0; AMT exposure; QSBS holding. Related: charitable remainder trust CLAT modeling, RMD age 73 SECURE 2.0, and UPIA trust accounting fiduciary. These are not unusual in a representative wealth-tech corpus — they should collectively appear in 8–15% of records. Below 5% means the long tail is not represented. See crypto DeFi tax engine and insurance illustration engine edge cases for parallel coverage problems.
Dimension 4: longitudinal continuity
Do the temporal records hold together across snapshots?
Annual data hides cash-flow seasonality. Monthly data exposes it — and exposes the fabrications that single-call generators introduce when they have to produce 50+ snapshots per household.
- Failure mode 1Seam discontinuitiesMonth 32 ending balance ≠ month 33 starting balance. Caused by chunked generation pipelines that don't enforce the seam invariant.
- Failure mode 2Magic moneyCash position increases without a corresponding inflow line. The most common LLM-pipeline bug.
- Failure mode 3Identity driftNet worth identity holds at month 1 but drifts past month 24. Means the validator only checked endpoints.
- Failure mode 4Rate-of-change anomaliesAccount balances changing at 3-sigma+ rates with no event flag. Real households rarely move 30% of net worth in a month — synthetic ones, depressingly often.
Test query 7. For every month boundary, compute the absolute difference between closing_balance[m] and opening_balance[m+1] for every account. Should be exactly zero (or within FP epsilon) for non-transfer accounts. A vendor that ships seam mismatches is shipping a dataset where time-series logic will break in non-deterministic ways downstream.
Test query 8. Histogram month-over-month percent change in net worth. The vast majority of values should fall in ±5%. Excursions beyond ±25% should correlate with explicit life events in the household record. Unexplained 25%+ excursions are generator hallucinations, not realistic outliers.
Dimension 5: provenance and reproducibility
Can the dataset be regenerated, audited, and traced?
This dimension is what separates a research artifact from a production-grade product. A synthetic dataset that cannot be reproduced from a versioned manifest fails any serious model-risk review (SR 11-7, OCC 2011-12, equivalent in the EU and UK). See SR 11-7 model risk management for the regulator-facing checklist. A synthetic dataset whose generation pipeline cannot be inspected fails any algorithmic-fairness audit.
| Reproducible | Inspectable | Audit-ready | |
|---|---|---|---|
| Versioned prompts + seeds | Yes — bit-for-bit | Yes | Yes |
| Versioned validators | Yes | Yes | Yes |
| Black-box generation API | Conditional — only if the API is versioned | No | Risky |
| Untracked manual edits | No | No | Disqualifying |
Test query 9. Ask the vendor for the manifest of the corpus you're considering. A real manifest has: model version, prompt version hashes per stage, validator version, random seed, archetype distribution, generation date. If the vendor cannot produce it within 24 hours, the corpus was probably hand-tweaked at some point and the reproducibility story is not real.
Test query 10. Pick a single household from the corpus. Ask the vendor: "Show me the prompt, the model output, and the validator output for this record." A vendor that has the artifacts can produce them in minutes. A vendor that has to "go check" is admitting they don't keep them.
Putting the rubric together
The five dimensions are not equally weighted. Structural integrity (Dimension 1) is a hard gate — fail it and nothing else matters because the dataset will produce wrong results before any model touches it. See privacy utility fidelity tradeoffs for the upstream design decision. The other four are soft gates with weighting that depends on use case.
suitability = is_pass(structural) × (0.30 × distributional + 0.25 × edge_case + 0.25 × longitudinal + 0.20 × provenance)- is_pass(structural)
- = 1 if structural identities hold across 99.9%+ of records, 0 otherwise
- distributional
- = 0–1, where 1.0 = KL divergence < 0.15 against public reference data
- edge_case
- = 0–1, where 1.0 = at least 8 of 10 named edge cases present at population-realistic frequency
- longitudinal
- = 0–1, where 1.0 = all four longitudinal failure modes absent
- provenance
- = 0–1, where 1.0 = full manifest, versioned prompts, on-demand artifacts
A vendor that scores 1.0 on structural, 0.85 distributional, 0.6 edge-case, 0.9 longitudinal, 0.7 provenance:
1.0 × (0.30×0.85 + 0.25×0.6 + 0.25×0.9 + 0.20×0.7) = 0.77
Above 0.75 is production-grade for general use; below 0.6 fails procurement; the 0.6–0.75 band is conditionally usable for the use cases that match the strong dimensions.What buyers do with this in practice
The procurement teams we've handed this rubric to use it three ways. See how synthetic data shortens QA cycles for the downstream benefit. First, as a vendor scorecard during evaluation — score every candidate, hand the spreadsheet to the architecture review board. Second, as an internal QA artifact once a vendor is selected — re-run the ten queries on every refresh of the corpus and track the scores over time. Third, as the basis for a written sign-off when the dataset enters production, which model-risk and audit teams universally appreciate.
The rubric is not perfect. Use cases that depend on adversarial signal (fraud detection, suspicious-activity monitoring) need an additional dimension we haven't covered here. See 12 transaction archetypes for fintech testing and training fraud detection with synthetic. Use cases that depend on demographic-conditional fairness need an explicit fair-lending audit dimension. Adapt accordingly.
What the rubric is not is optional. A team that adopts synthetic financial data without a formal evaluation framework is going to discover the failure modes the framework would have caught — except they'll discover them in production, six months in, after the model has already been signed off. Related: 8 synthetic data mistakes to avoid and how synthetic financial data is generated.
Key takeaways
- Vendor data sheets do not survive procurement review. A five-dimension rubric does. See [what is synthetic financial data](/articles/synthetic-financial-data-primer) for the upstream primer.
- Structural integrity is a hard gate. Distributional fidelity, edge-case coverage, longitudinal continuity, and provenance are weighted soft gates.
- Pairwise correlation scores are insufficient — three-way joint distributions are where most synthetic datasets fail.
- Edge cases are the most common point of catastrophic failure in production. Score the dataset against the edge cases your engine actually has to handle. See [10 edge cases your test corpus must include](/articles/10-edge-cases-wealth-test-corpus) for the inventory.
- If the vendor cannot produce a per-record prompt + output + validator artifact, the reproducibility story is not real.