Term · Fully fabricated household financial records

Synthetic Wealth Data

Q: When does synthetic wealth data fall short?

Two known limits: the dataset only contains edge cases the generator was told to model, and any algorithm whose output depends on specific named institutions (Fidelity-only quirks, Schwab-specific cost-basis methods) will need either a custodian-conditional overlay or supplementary real-world testing.

Published May 7, 2026

Definition

Synthetic wealth data is artificially generated financial data that mimics the structure, distributions, and edge cases of real household portfolios — accounts, holdings, transactions, tax lots, beneficiaries — without copying or anonymizing any real individual's records. Properly built, it carries zero re-identification risk while preserving the joint-distribution realism your engine actually needs to test.

The category exists because the two obvious alternatives — anonymized real data and naively-generated mock data — both fail. Anonymized real data leaks: the combination of ZIP code, date of birth, and gender uniquely identifies 87% of US individuals (Sweeney 2000), and wealth datasets carry far more re-identification surface than that. Naive mock data passes type checks but has no joint-distribution realism — every household has the same archetypal structure, no edge cases, no correlated risk.

Production-grade synthetic wealth data sits in the middle: schema-correct, statistically distributed against real demographic and financial benchmarks, with the rare-but-important edge cases (UHNW carryforward losses, spouse-account wash-sale triggers, pre-IPO equity comp) explicitly modeled. The generation process typically combines an archetype-driven structure (population targets per demographic and wealth segment) with constrained model output (LLMs filling the schema within the archetype's bounds) and a strict validation gate (any inconsistency fails the household).

The distinguishing feature versus generic synthetic-data tools (Mockaroo, Faker, Tonic.ai) is domain depth: a useful wealth synthetic dataset has to model lot-level cost basis, multi-account wash-sale aggregation, RMD timing, IRMAA tier transitions, equity-comp vesting schedules, and trust-distribution rules — every one of which is a separate engine inside a real wealth-tech platform.

Re-identification rate by location granularity

Percentage of the US population uniquely identifiable by {location, full date of birth, gender} per each cited study. Even the most conservative estimate (county granularity, 18%) leaves ~60M people uniquely identifiable from three fields alone — and wealth datasets retain all three plus income, holdings, and household composition.

What a household record actually contains

household: object

├─ _meta: object · schema version, seed, archetype id, generation method

├─ persona: object · demographics, geography, generation cohort, story

├─ members: list[1–5] · per-member: role, age, employment, salary, SS projection

├─ cash_flow: object

│ ├─ income_sources: list · W-2, self-emp, pension, RMD, SS, K-1, etc.

│ ├─ expenses_monthly: object[23] · housing, healthcare, childcare, eldercare, …

│ ├─ gross_income_annual: int

│ └─ net_income_annual: float

├─ assets: object

│ ├─ checking_accounts: list

│ ├─ savings_accounts: list

│ ├─ brokerage_accounts: list · lot-level basis, holding period, wash-sale flags

│ ├─ retirement_accounts: list · 401k / IRA / Roth / 403b / SEP, per-owner

│ ├─ hsa_accounts: list

│ ├─ education_accounts: list · 529 / Coverdell, per-beneficiary

│ ├─ real_estate: list

│ └─ alternative_investments: list · private equity, real assets, crypto

├─ liabilities: object · mortgage, credit cards, student loans, auto

├─ tax_profile: object · filing status, AGI, AMT, carryforwards, IRMAA

├─ insurance: object · life, disability, LTC, P&C, umbrella

├─ credit_profile: object · FICO, utilization, DTI, payment history

├─ estate_profile: object · will, trusts, beneficiaries, gift history

├─ behavioral_profile: object · risk tolerance, advisor engagement, financial literacy

├─ goals: list · retirement, education, home, philanthropy, with target dates

├─ life_events: list · marriage, divorce, inheritance, job change, exit, death

├─ stress_scenarios: list · job loss, market drawdown, health event, sequence risk

├─ longitudinal: object

│ ├─ monthly: list[96] · 60 historical + 36 projected, ~20 metrics each

│ ├─ summary: object · start/end NW, CAGR, savings rate, peak-trough

│ └─ methodology: object · sampler version, deterministic seed, regen reason

├─ macro_environment: object · inflation, market returns, rates regime per generation period

├─ advisor_context: object · channel, fee schedule, AUM band, recommendation history

├─ derived_ratios: object · savings rate, DTI, liquidity ratio, concentration, etc.

└─ bundle_tags: list · which Wealth Data Sets this household belongs to

The actual top-level shape of a household record (sampled from A-01-seed-1, Young Family — First Home). Every household carries ~24 top-level sections plus 96 longitudinal monthly snapshots; bundle-specific overlays add further nested sections. Field names match the JSON keys exactly.

Why this matters for synthetic data

Three design constraints separate audit-grade synthetic wealth data from everything else. First: zero leakage by construction — no field can join back to a real person via any identifier or quasi-identifier combination. Second: explicit demographic distribution control — race, religion, and other protected-class fields are never present in the default schema (they appear only as conditional overlays for use cases that require them). Third: edge-case completeness — the dataset must include the failure modes your engine actually depends on (IRA wash-sale cross-triggers, AMT cliffs, multi-state tax filers), not just the happy path.

Common pitfalls

Using LLM output without an archetype constraint, producing households whose joint distributions drift away from any real population segment.
Skipping cross-field validation — net worth that doesn't equal assets minus liabilities, FICO inconsistent with payment history, age inconsistent with life events.
Treating 'synthetic' as license to skip privacy review — the metadata about your generation pipeline can itself be sensitive.
Over-fitting on the archetype: every household in a segment becomes statistically identical, which kills any test that depends on diversity within a segment.

Examples

Household record (top-level)

Schema-shape example showing the typical envelope of a synthetic household. The full schema has ~30 universal sections plus per-bundle overlays.

{
  "household_id": "H-2026-04-7821",
  "archetype_id": "P-03",
  "wealth_tier": "affluent",
  "members": [
    { "role": "primary", "age": 47, "occupation": "engineering_manager" },
    { "role": "spouse", "age": 45, "occupation": "designer" },
    { "role": "dependent", "age": 12 }
  ],
  "accounts": [/* 5–8 accounts, lot-level */],
  "longitudinal": "/longitudinal/H-2026-04-7821.json",
  "validation_passed": true
}

Frequently asked questions

Is synthetic wealth data covered by GLBA, GDPR, or CCPA?+

No. Those regimes regulate data about identified or identifiable natural persons. A properly-generated synthetic household has no real person on the other side of any record, so it falls outside the regulatory scope. The legal posture mirrors the synthetic test data shipped by major cloud providers and major fintech infrastructure vendors.

How realistic does it need to be to backtest a real algorithm?+

The bar is joint-distribution realism, not record-level realism. Your algorithm cares about whether the correlation between concentration risk and household income looks like the real population — not whether any individual record could exist in the wild. Audit-grade datasets target <2% reviewer-flag rate on internal QA passes.

When does synthetic wealth data fall short?+

Two known limits: the dataset only contains edge cases the generator was told to model, and any algorithm whose output depends on specific named institutions (Fidelity-only quirks, Schwab-specific cost-basis methods) will need either a custodian-conditional overlay or supplementary real-world testing.