wealthschemaresourcesarticlesWhy we built WealthSchema on synthetic data instead of anonymized real data
Article

Why we built WealthSchema on synthetic data instead of anonymized real data

ZIP + DOB + gender re-identifies 87% of Americans (Sweeney 2000); add an income band and it crosses 99%. Anonymized wealth datasets retain all four because the models that use them depend on them.

WealthSchema StaffSynthetic data, R&DMay 7, 20264 min read

The first question every prospective buyer asks is the same: why not just buy anonymized data? It is cheaper. The records are real. The relationships are real. And every fintech engineer has at least once stitched together a "production-like" dataset by stripping names off a CSV their compliance team handed them.

We tried that path in 2024. Two months in, three of our test households were uniquely identifiable from public LinkedIn pages plus a single ZIP code. The next month a fourth re-identified through a 13F filing. We threw out anonymization and rebuilt on a fully synthetic foundation that month.

This article is the case for why every team that handles wealth data — for testing, for analytics, for backtesting — should do the same.

The math of re-identification

The classic result is from Latanya Sweeney (2000): 87% of Americans are uniquely identified by the combination of ZIP code, date of birth, and gender. Add a single income band and the number rises above 99%. Anonymized financial data routinely retains all four of those fields because models depend on them.

For a wealth dataset to be useful for backtesting, it has to retain more than ZIP + DOB + gender. It has to keep account balances, tax-filing status, employer industry, household composition, and occupation. Each of those is itself a re-identification vector.

What "anonymization" actually means in practice

In practice, the term "anonymized" covers four distinct techniques, each with a different leakage profile.

 RiskAudit story
Hashing identifiersTrivial — rainbow tables for SSNsLooks anonymous on first glance, isn't
TokenizationLookup table is the entire risk surfaceAudit grade depends on token storage
k-anonymity (k≥5)Strong against direct lookup, weak against linkage attacksDefensible if k is enforced AND auxiliary fields are bounded
Differential privacyStrongest formal guaranteeReal but not sufficient for record-level analytics — adds noise that breaks backtesting

The honest middle ground for wealth data is k-anonymity at k≥5 plus aggressive top-coding (truncating any account balance above the 95th percentile, generalizing any ZIP smaller than 1,000 households). What you get is data that is no longer realistic — it cannot be used to test edge-case logic like UHNW household carryforward losses or single-state ZIP-level sales-tax calculations. The exact things your engine needs to handle correctly are the things k-anonymity has to file off.

Synthetic data, properly built, has none of these problems

A correctly generated synthetic household has no causal link to a real person. There is no joinable identifier. The financial statements are internally consistent — net worth = assets minus liabilities to the dollar — but the underlying entity does not exist in any registry, payroll system, custodial account, or property record.

The promise of anonymized data is that you can have realistic edge cases AND privacy guarantees. You can't. Synthetic data trades a small loss in 'realness' for a zero-leak guarantee — and we found the realness loss was much smaller than we expected.

Internal R&D retro, 2024

The realness gap shrank with model quality. Companion piece: synthetic data fidelity privacy utility tradeoff. Our 2024 prototype, generated on GPT-4-class models, had 18% of households flagged by reviewers as obviously synthetic ("a 27-year-old with a $4M brokerage account, no income, no inheritance"). The 2026 v4 corpus, generated through the staged pipeline with explicit archetype constraints and cross-field validation, has that number under 2% — and the remaining flags are usually compositional issues we want to surface anyway.

Fair-lending and disparate-impact exposure

Anonymized real data carries a second risk that has nothing to do with re-identification: it carries the biases of the historical underwriting decisions that produced it.

A model backtested against historical lending decisions inherits the demographic skew of those decisions. Even if individual records are anonymous, the joint distribution of credit decisions, ZIP codes, and income bands encodes patterns that fail Reg B / ECOA fair-lending audits. The federal regulators have made clear that "we tested against anonymized real data" is not a defense.

This is a real reason fintech teams have moved to synthetic data even when their privacy posture didn't strictly require it. The race_ethnicity and religion fields in our schema are explicitly conditional overlays — never present in the default household — for exactly this reason.

What we ship instead

WealthSynth households are generated against 71 archetypes with hardcoded population targets. Every household passes a strict validation gate before promotion: arithmetic identities, archetype fidelity, internal consistency, and bundle-overlay reconciliation. The deliverable is a structured JSON profile per household — no rendered documents, no LLM-authored narrative prose. Profiles are generated deterministically from archetype templates and applicable bundle overlays, with an LLM-as-judge gate catching residual quality issues; total LLM spend on a full 1,451-household corpus refresh is a fraction of the cost of a single re-identification incident.

What an audit-grade synthetic dataset must demonstrate

  • No record traces back to a real individual via any joinable field (zero-leak by construction)
  • Demographic distributions are explicitly controlled and documented
  • Sensitive fields (race, religion) are absent unless explicitly required for a use case
  • Stress edge cases (UHNW carryforwards, multi-state filers, IRA wash-sale triggers) are present
  • Generation methodology is reproducible and version-controlled
  • Validation gate is strict — any warning fails the household, no exceptions

The bottom line

We built WealthSchema because every team we talked to had the same problem: they wanted realistic test data, the only "realistic" data was anonymized real data, and the privacy posture of anonymized real data did not survive any honest threat model. Synthetic was the only path that made the privacy story trivial — and once we got the realness gap below 2%, it stopped being a trade-off at all.

If your engine handles tax lots, equity grants, retirement income sequencing, or any of the edge cases real underwriting actually depends on, you need data with all the structure and none of the leakage. See synthetic financial data primer for the basics. That is the job WealthSchema exists to do.

Key takeaways

  • Anonymization is not a privacy guarantee for wealth data — ZIP + DOB + gender alone re-identifies 87% of US individuals. See [GLBA GDPR CCPA synthetic data](/articles/glba-gdpr-ccpa-synthetic-data).
  • k-anonymity strong enough to pass audit also files off the edge cases your engine needs to handle.
  • Anonymized historical data carries the biases of the original underwriting — fair-lending exposure even when records are 'anonymous'.
  • Properly-generated synthetic households remove both the re-identification and disparate-impact risks at the cost of a small (<2%) realness gap.

Frequently asked questions

Is your synthetic data covered by GLBA / GDPR?+
No. GLBA covers Non-Public Personal Information about real consumers. GDPR covers personal data about identified or identifiable natural persons (Art. 4(1)). Synthetic households are neither — there is no real person to which they could be linked. The legal posture is identical to the synthetic test data shipped by the major cloud providers. See [GLBA Safeguards Rule implementation](/articles/glba-safeguards-rule-implementation-guide).
How do you keep distributions realistic without copying real data?+
Archetype-driven generation. Each household sits inside one of 71 archetypes whose population targets and statistical bands are derived from public sources (IRS SOI, FRB SCF, BLS CES). The LLM fills the schema; the archetype constrains the joint distribution. We never see the underlying records that informed the archetype — they're public aggregates.
Can I still use real anonymized data alongside WealthSynth?+
Yes — we see this often. Teams use WealthSynth for the deterministic edge cases (Reg BI suitability, lot-level TLH, RMD calculations) and a small anonymized real-data sample for distribution checks. The synthetic data carries the schema; the real data confirms the distribution.