Synthetic Wealth Data
Synthetic wealth data is artificially generated financial data that mimics the structure, distributions, and edge cases of real household portfolios — accounts, holdings, transactions, tax lots, beneficiaries — without copying or anonymizing any real individual's records. Properly built, it carries zero re-identification risk while preserving the joint-distribution realism your engine actually needs to test.
The category exists because the two obvious alternatives — anonymized real data and naively-generated mock data — both fail. Anonymized real data leaks: the combination of ZIP code, date of birth, and gender uniquely identifies 87% of US individuals (Sweeney 2000), and wealth datasets carry far more re-identification surface than that. Naive mock data passes type checks but has no joint-distribution realism — every household has the same archetypal structure, no edge cases, no correlated risk.
Production-grade synthetic wealth data sits in the middle: schema-correct, statistically distributed against real demographic and financial benchmarks, with the rare-but-important edge cases (UHNW carryforward losses, spouse-account wash-sale triggers, pre-IPO equity comp) explicitly modeled. The generation process typically combines an archetype-driven structure (population targets per demographic and wealth segment) with constrained model output (LLMs filling the schema within the archetype's bounds) and a strict validation gate (any inconsistency fails the household).
The distinguishing feature versus generic synthetic-data tools (Mockaroo, Faker, Tonic.ai) is domain depth: a useful wealth synthetic dataset has to model lot-level cost basis, multi-account wash-sale aggregation, RMD timing, IRMAA tier transitions, equity-comp vesting schedules, and trust-distribution rules — every one of which is a separate engine inside a real wealth-tech platform.
Three design constraints separate audit-grade synthetic wealth data from everything else. First: zero leakage by construction — no field can join back to a real person via any identifier or quasi-identifier combination. Second: explicit demographic distribution control — race, religion, and other protected-class fields are never present in the default schema (they appear only as conditional overlays for use cases that require them). Third: edge-case completeness — the dataset must include the failure modes your engine actually depends on (IRA wash-sale cross-triggers, AMT cliffs, multi-state tax filers), not just the happy path.
Common pitfalls
- Using LLM output without an archetype constraint, producing households whose joint distributions drift away from any real population segment.
- Skipping cross-field validation — net worth that doesn't equal assets minus liabilities, FICO inconsistent with payment history, age inconsistent with life events.
- Treating 'synthetic' as license to skip privacy review — the metadata about your generation pipeline can itself be sensitive.
- Over-fitting on the archetype: every household in a segment becomes statistically identical, which kills any test that depends on diversity within a segment.
Examples
Schema-shape example showing the typical envelope of a synthetic household. The full WealthSynth schema has ~30 universal sections plus per-bundle overlays.
{
"household_id": "H-2026-04-7821",
"archetype_id": "P-03",
"wealth_tier": "affluent",
"members": [
{ "role": "primary", "age": 47, "occupation": "engineering_manager" },
{ "role": "spouse", "age": 45, "occupation": "designer" },
{ "role": "dependent", "age": 12 }
],
"accounts": [/* 5–8 accounts, lot-level */],
"longitudinal": "/longitudinal/H-2026-04-7821.json",
"validation_passed": true
}