Pre-Launch Synthetic-Data Fidelity QA Checklist
Synthetic data is only as useful as its fidelity to the real-world distribution your feature will encounter in production. A corpus that passes a smoke test can still mis-train models, produce demo screenshots that fool you about realism, and pass through engineering gates only to fail when real customers arrive. This checklist is the gate to run before you commit a feature to production against synthetic test data — built from the failure modes we've seen most often in pre-launch reviews.
Arithmetic & accounting invariants
- Net-worth identity holds for every household
For every household, total assets minus total liabilities equals reported net worth within $1. Drift here means the corpus has additive errors that compound across features.
assert(abs(assets.total - liabilities.total - net_worth) < 1.0) - Account-level sums tie to portfolio totals
Sum of lots in each account ties to the account balance. Sum of accounts ties to the household portfolio total. Off-by-one penny errors in scientific-notation roundings are findings.
assert(sum(lots[].current_value) == account.balance) - Cash-flow identity over the longitudinal window
Net change in net worth across the 96-month window equals cumulative income − cumulative expenses + cumulative investment returns − cumulative taxes. Anything else is a generation bug.
assert(snapshot[95].nw - snapshot[0].nw == sum(income) - sum(expense) + sum(returns) - sum(tax)) - DTI and LTV ratios reconcile to line items
Debt-to-income and loan-to-value ratios are recomputable from the underlying line items. Pre-computed ratios that don't match recomputed ones indicate a stale snapshot.
assert(dti_ratio == debt_payments_annual / income_annual)
Demographic & income plausibility
- Age × income joint distribution
Plot age vs. income for the corpus and overlay the BLS / ACS reference distribution. Synthetic corpora that flatten the age-income relationship produce features that work well for a 35-year-old and silently fail for a 65-year-old.
plot_age_income(corpus) overlay BLS_2024 - State distribution reflects target market
If your buyer is national, the state distribution should reflect ACS population weights (CA, TX, FL, NY each ~6-12%). If the buyer is regional, the state mix should reflect the regional target.
state_dist == target_market_dist within tolerance - Occupation × industry × income reasonableness
Occupations should match industries (no nuclear-engineer / hospitality industry pairings) and income should be consistent with the BLS occupational-employment-statistics median for that occupation.
validate(occupation, industry, income) against BLS_OES - Marital status × dependent count × age
Single 25-year-olds rarely have 4 dependents. Married 60-year-olds rarely have toddler dependents. Joint plausibility checks catch generation bugs that field-level checks miss.
validate_demographic_plausibility()
Longitudinal coherence (96-month invariants)
- Within-year cash-flow seasonality present
Real cash flow is seasonal — bonus receipts (Q1, Q4), tax payments (Q2), holiday spending (Q4), summer expenses (Q3). Flat monthly cash flow is a tell-tale sign of a quarterly-or-annual generation that was post-hoc divided.
validate(cf_monthly, expected_seasonality_curve) - Asset-allocation drift over time
Across 96 months, asset allocation should drift with market movement and life-event reallocations. Static allocation across all 96 snapshots is a generation bug.
validate(asset_allocation_drift_variance > min_threshold) - Major-event continuity
Job changes, relocations, marriages, births, retirements should propagate through subsequent snapshots — not appear and revert. Most longitudinal generation bugs reveal themselves here.
validate(life_events_persist_forward()) - RMD start triggers correctly at age 73
For households crossing age 73 in the longitudinal window, RMDs must start in the correct year (SECURE 2.0) and amounts must reconcile with the Uniform Lifetime Table.
validate(rmd_start_age == 73 && rmd_amount == ult_lookup(age, balance))
Scenario coverage for the feature under test
- Edge-case archetypes are present, not just the modal one
Verify the corpus contains the edge cases your feature must handle — ITIN filers, multi-state residents, military with combat-pay exclusions, K-1 recipients with complex pass-through. Modal-only corpora produce features that crash on the long tail.
verify(coverage_of, edge_case_archetypes[]) - Volume per archetype is statistically meaningful
If you're claiming the feature works for archetype X, the corpus should have at least 30 households of archetype X (lower bound for any reportable mean). 1-of-1 archetype examples are demo data, not test data.
assert(min(archetype_counts) >= 30) - Distress / negative scenarios represented
Real customer populations include distressed households — bankruptcy, default, divorce, job-loss-then-recovery. A corpus with no distress scenarios produces models that fail under stress.
verify(distress_scenario_count >= target_threshold)
Privacy & determinism gates
- Re-identification risk under presumed-known-fields attack
For each presumed-known-fields scenario (5-digit ZIP + age + income band, or LinkedIn-equivalent profile), no synthetic record should be uniquely identifiable. Sweeney-test the corpus.
validate(no_unique_records(presumed_known_fields)) - Determinism: same seed → same corpus
Regenerate the corpus from the same seed and confirm bit-identical output. Non-determinism breaks regression testing and prevents reproducible algorithm benchmarks.
assert(generate(seed=N) == generate(seed=N)) - No accidental real-name overlap
Spot-check a sample of synthetic names against a known-real-person list (publicly known executives, public-company employees). Overlap suggests the generator pulled from a real-name source.
validate(corpus_names ∩ real_name_blocklist == ∅)
Key takeaways
- Arithmetic invariants are the baseline — net-worth identity, account-level sums, cash-flow identity. A corpus that fails these is unfit regardless of how rich the surface looks.
- Plausibility checks must be joint, not field-level. Field-level checks miss the 'nuclear engineer in hospitality' joint-distribution bugs that destroy ML model utility.
- Edge-case coverage trumps volume. A 1,000-household corpus that only contains W-2 salaried W-residents produces features that crash on K-1 + multi-state customers in week one of production.
- Determinism (same seed → same corpus) is what makes regression testing real. Non-deterministic synthetic data is a developer-experience tax that compounds.