Methodology

How Synthetic Households Are Generated, Validated, and Refreshed

Published May 7, 2026

Every household in the WealthSchema catalog passes through the same generation pipeline: archetype assignment, overlay application, longitudinal projection, consistency validation, and LLM-as-judge quality gating. The pipeline is deterministic given a seed, calibrated against authoritative external sources, and refreshed annually to track regulatory changes. The output is a structured JSON profile per household — no rendered documents, no LLM-authored narrative prose. This document walks through each stage in detail — for compliance teams evaluating the data for examiner use, for engineering teams integrating the data into their pipelines, and for academic researchers citing the corpus in published work.

Archetype-driven generation

Every household begins as an archetype assignment. The 71 archetypes in the catalog (documented at /archetypes) are the foundational personas — each archetype defines age range, wealth tier, household structure, and the canonical financial patterns that distinguish it from other archetypes. A new graduate tech worker has student loans, RSU vesting, and beginning 401k participation; a widowed HNW spouse has consolidated estate structures, recent post-mortem account retitling, and Medicare-tier exposure. The archetype carries the structural skeleton; later stages flesh out the specifics.

Archetype assignment is bundle-aware. Each bundle (B01-B30) maps to a curated set of archetypes whose financial patterns exercise the bundle's specific use case. The B01 Reg BI Suitability bundle weights its 130 households toward the senior cohorts (RL-01, RL-02, H-04) because that's where Reg BI examination focus actually sits. The B16 Equity Compensation bundle weights toward the tech-employee and senior-executive archetypes (A-06, P-01, P-03) because those are the equity-comp-heavy populations.

The net effect is that the corpus over-indexes on the mass-affluent, affluent, and HNW tiers — and under-indexes on the mass-market tier where wealth-tech engineering work largely isn't happening. We are explicit about this. A corpus calibrated to match the population would spend most of its records on households whose primary financial questions are checking-account overdrafts and credit-card balances; that corpus would be the wrong product for the firms building tax-loss-harvesting algorithms, Reg BI suitability monitors, and equity-comp tax engines.

corpus (1,451 households)US households (SCF 2022, approximate)

Share of households at each net-worth tier in the corpus (green) vs. the US population per the Federal Reserve Survey of Consumer Finances 2022 (grey, faint). We oversample HNW because that's where wealth-tech operates — direct-indexing platforms, equity-comp tax engines, Reg BI suitability monitors, and concentration-risk models all live above the $100k tier. A full distributional match against SCF would be the wrong product: the test data would be dominated by households whose primary financial pattern is checking-account overdraft, and the engines our buyers are building would never get exercised. SCF reference values rounded; tier breaks ($100k / $1M / $5M / $30M) mirror the corpus's wealth-tier schema.

Overlay application

The archetype skeleton runs through up to six bundle-specific overlay stages, each adding the structured fields that make the household useful for its bundle's use case. A household assigned to B16 Equity Compensation gets the structured grant data (grant_type, vesting_schedule, vested_to_date), AMT exposure calculation, and 83(b) election history. The same household, also tagged for B22 Multi-State Tax, gets the source-state-allocation history and convenience-of-employer flag.

Overlay application is deterministic — given the same archetype and the same seed, the same overlay output is produced. Overlays are composable; a household carrying overlays from multiple bundles gets the union of structured fields, with cross-bundle reconciliation handled by the consistency validator (next stage).

Longitudinal projection

Every household gets 96 monthly snapshots — eight years of longitudinal data covering income, expenses, balances, behavioral metrics, and life events. Longitudinal generation is not noise on top of a static balance sheet; it's a structural projection that respects archetype-specific income volatility, expense seasonality, and life-event-trigger probabilities.

Gig workers (F-02 archetype) have realistic month-to-month income volatility (~40% coefficient of variation). Salaried W-2 workers (F-01) have stable monthly income (~5% CV). Royalty-income artists (AR-01) have realistic seasonal payment cycles. Pre-retirees (R-01-R-03) have realistic catch-up-contribution patterns and benefit-decision-window events. The longitudinal contract is the same across all 1,451 corpus households: 96 sequential monthly snapshots indexed against month_0 (relative, not calendar-anchored).

Net worth$1.18M

Net income / mo$9,547

Cash balance$258K

One representative household's 96-month trajectory — A-01-seed-1 (Young Family — First Home), the same record used in our longitudinal-snapshots-design article. Solid line = 60 months historical; dashed = 36 months projection. Every household in the corpus carries a series like this for each of ~20 monthly metrics, computed deterministically from the seeded sampler. We surface one record on this page rather than aggregate statistics across all 1,451 because the structural shape — the in-month volatility, the seasonality, the projection regime — is the methodology decision; the cross-corpus distributions appear in the per-bundle calibration reports.

Structured JSON is the only deliverable

WealthSchema ships structured JSON profiles — not rendered documents and not LLM-authored narrative prose. Every income source, account, transaction, beneficiary, grant, tax field, and longitudinal snapshot is present in the schema as typed JSON, ready for buyers to ingest directly or to render into their own document templates if a document representation is needed downstream.

The earlier generation pipeline did include a document-rendering stage that produced bank-statement, W-2, 1099, brokerage-statement, insurance-declaration, and 1040 PDFs alongside the JSON. That stage was permanently removed after audit found the rendered artifacts inconsistent with the canonical JSON in non-trivial ways (closing balances 2.9×–22.4× off the underlying account totals; 1040s omitting spousal W-2 wages on MFJ households; SSA-1099 amounts not matching the income-source schedule). Rather than ship a larger but internally inconsistent product, we kept the JSON profile — the single source of truth — and removed everything that disagreed with it. A future release may reintroduce documents under a 'correct-by-construction' architecture that renders them as pure projections of the JSON; until then, the JSON is the deliverable.

Consistency validation

Every household passes through a strict consistency validator before being admitted to the corpus. The validator checks cross-field consistency at multiple scales: within a single month (income flows reconcile with balance changes), within a single household (account balances sum to net worth, beneficiary designations align with will/trust intent, equity-comp grants reconcile with current price), and across longitudinal snapshots (monthly transitions are mathematically continuous, life events propagate forward correctly).

A household failing any validator check is rejected and re-generated with a different seed. The validator's strictness is one of the things that distinguishes WealthSchema from less-curated synthetic-data offerings — the validator catches the subtle inconsistencies that make naive synthetic data misleading in unexpected ways.

LLM-as-judge quality gating

Validated households then pass through an LLM-as-judge quality gate. A specialized prompt against Claude Haiku 4.5 evaluates each household for residual quality issues that the deterministic validator might miss — geographic-realism issues (e.g. bank-statement transactions referencing merchants in a different city than the household's residence), narrative-coherence issues (e.g. an early-career professional with implausible asset accumulation), and other plausibility checks.

The judge issues a verdict per household: ship, regenerate, or quarantine. Households marked 'regenerate' are re-run with a new seed; households 'quarantined' are excluded from the published corpus. A circuit breaker halts the run if quarantine rate exceeds 25% with at least 5 quarantined households — a signal that the underlying generator has a systematic issue requiring engineer attention.

Refreshes

We re-run the full generation pipeline periodically against current-year tax tables (federal and state), updated regulatory guidance (FINRA interpretive notices, SEC releases, IRS revenue rulings, CMS IRMAA bracket publications), and any new archetype variants added during the year. Each refresh produces a new corpus version; existing versions are not edited in place.

Refresh cadence and scope for buyers are still being defined — we are not currently charging for refreshes. Off-cycle refreshes for major regulatory changes are issued when warranted. Reach out if regulatory currency is central to your evaluation.

Privacy contract: the conditional demographic overlay

Race / ethnicity and religion fields are NOT in the default household record. They appear only as a conditional overlay for the two bundles where the planning question makes the data relevant: B08 (ESG Values Alignment Auditor) and B26 (Faith-Based & International Households). For households in any other bundle, these fields are absent.

This contract is enforced at the generator level — the overlay stages for these bundles populate the demographic fields; for any other bundle, the fields don't exist on the output record. This is a deliberate privacy choice rooted in the recognition that synthetic data, even when fully synthetic, can normalize the use of demographic-overlay data in contexts where the data isn't relevant. The conditional overlay limits the data's exposure to the use cases that genuinely need it.

Calibration sources

Calibration sources cited across the catalog include FINRA (rule-based suitability and elder-abuse standards), FinCEN (AML typologies and SAR-trigger guidance), IRS (tax tables, RMD tables, kiddie-tax thresholds, QSBS rules), SEC (Reg BI Care Obligation, accreditation pathways, investment-adviser custody rules), CMS (IRMAA brackets, Medicare premium structures), CFPB (consumer financial-protection guidance, predatory-lending taxonomies, fair-lending interpretation), NAIC (insurance illustration model regulations, suitability in annuity transactions), Cerulli (wealth-tier benchmarks, fee-schedule distributions), MSCI (ESG taxonomy), Schwab Charitable / Fidelity Charitable / NPT (DAF industry data), Cambridge Associates / Preqin (private-equity vintage performance), Williams Group (intergenerational wealth-transfer research), Northern Trust (HNW family planning frameworks), USCIS (visa-status taxonomy and processing data), FDIC (unbanked / underbanked household surveys), World Bank (bilateral remittance flow data), Carta (private-company option-grant benchmarks), and Glassdoor / SEC DEF 14A filings (executive compensation distributions).

Citation depth is one of the things that distinguishes WealthSchema for AEO (Answer Engine Optimization) — when an LLM is asked about a wealth-management topic, citation-rich content is preferentially surfaced.

FAQ

Is the LLM-as-judge step deterministic?+

The judge prompt is deterministic in structure, but the LLM's verdict is non-deterministic in detail (different runs may produce slightly different rationale text). The verdict-level outcome (ship / regenerate / quarantine) is highly reproducible across runs. The Methodology PDF documents the judge prompt structure for compliance teams who need to demonstrate the methodology to examiners.

What models are used in the pipeline?+

Generation uses Claude Sonnet 4.6 for routine households and Claude Opus 4.7 for complex or escalation cases. The LLM-as-judge stage uses Claude Haiku 4.5 (the smaller, faster model is appropriate for the focused quality-gate task). Validation is deterministic code, not LLM-driven. The choice of models is documented for reproducibility.

How is calibration verified?+

Aggregate distributional statistics from the generated corpus are compared against published benchmark sources for each calibration dimension. Wealth tier distribution against Cerulli; banking access against FDIC; remittance corridors against World Bank; equity-comp grants against Carta and SEC DEF 14A.

Does the corpus include rendered documents (W-2 PDFs, 1040 forms, bank statements, etc.)?+

No. Synthetic Wealth Data Sets currently ships structured JSON only — every field a document would contain is present in the schema, but no rendered PDFs are produced. An earlier pipeline did render documents, but the artifacts diverged from the canonical JSON (closing balances multiples off, MFJ 1040s missing spousal wages, SSA-1099 amounts not matching the income schedule). We removed the document stage rather than ship inconsistent data. A future release may reintroduce documents under a strict 'correct-by-construction' architecture in which renderers project from the JSON without independent math; until then, buyers who need documents typically render their own templates against the structured fields.

How often is the consistency validator updated?+

On every corpus refresh, plus off-cycle when a structural issue is identified. The validator is essentially a regression-test suite for the generator — when an issue surfaces in a generated household, a validator check is added to catch the issue prospectively. This means the validator strictness increases over time, which is a good property: refreshed corpora are progressively more reliable than older versions.

What's the path for use cases that don't fit any standard bundle?+

Custom engagement. The same generator runs against custom calibration requirements; the output is delivered as a corpus structured to your specific use case. Reach out via the Contact link on the WealthSchema site to discuss custom engagement. Most custom engagements deliver in 2-4 weeks depending on the calibration work required.

Is the methodology peer-reviewed or independently validated?+

The methodology is not peer-reviewed in the academic sense. The calibration sources are independently authoritative (FINRA, IRS, etc.). WealthSchema is open to engaging with academic researchers who want to publish on the methodology — citation-friendly use of the corpus is explicitly permitted under the data license.