wealthschema/data sets/wealthsynth-master-corpus
All Data Sets
B31Master CorpusEverything

WealthSynth Master Corpus

The WealthSynth Master Corpus is the foundation every other Wealth Data Set is sliced from. 1,451 synthetic households spanning all 71 archetypes, every bundle overlay applied where eligibility holds, and 96 months of longitudinal data per household — delivered as structured JSON profiles. It is the canonical source-of-truth corpus for organizations building multi-product fintech infrastructure or running structured research programs against synthetic financial data.

Households
1,451
Archetypes
71
Formats
JSON, CSV, Parquet
Deviation
High

Why this Data Set exists

Buying individual Wealth Data Sets makes sense when the use case is narrow — Reg BI testing, tax-loss-harvesting backtesting, equity-comp planning. For organizations whose product surface spans multiple bundles, buying them individually is operationally expensive: separate purchase events, separate licence agreements, separate ingestion pipelines, separate refresh cycles. The Master Corpus exists for the buyers whose product surface is broad enough that the unit economics inverted: it's cheaper to buy everything than to buy the relevant subset of bundles individually.

The second use case is harder to articulate but equally important: research and infrastructure that doesn't fit any single bundle scope. An academic research program studying intergenerational wealth transfer needs households with longitudinal data and the structural diversity of the full population; a platform vendor building data-pipeline infrastructure for advisor firms needs the realistic structural diversity that only the full corpus provides; a regulatory consultancy building analytics tooling needs the integrated picture across all 30 bundle overlays. None of these use cases buy a single bundle — they buy the corpus.

Use Cases

Platform vendor data infrastructure
Academic research programs
Multi-product fintech development
Internal compliance / training data

Who uses this Data Set

Platform Vendor Building Data-Pipeline Infrastructure

Uses the Master Corpus as the canonical fixture set for data-pipeline development and regression testing. Every advisor-firm customer's data ingestion gets validated against the realistic structural diversity of the 1,451-household corpus before going to production.

Large RIA Building Multi-Product Tools

Uses the corpus for internal tool development across compliance, tax, estate, and behavioural surfaces — buying once and using across multiple product workstreams.

Academic Research Program

Studies wealth-related research questions (intergenerational transfer, behavioural patterns, demographic equity) using a corpus where the longitudinal data and structural diversity support research designs that single bundles can't.

Internal Compliance and Training Data Function

Uses the corpus as the institutional training data set for compliance staff onboarding, audit-readiness exercises, and supervisory-process testing — without exposing any real client data to staff in training.

Regulatory Consultancy

Builds analytics tooling against the full structural diversity of the corpus, demonstrating regulatory analyses to client firms with realistic populations spanning every demographic and life-stage segment regulators care about.

What's inside

The Master Corpus is everything: all 1,451 households across all 71 archetypes from src/data/archetypes-v3.ts; every applicable bundle overlay (B01–B30) populated for each household based on archetype-bundle-eligibility logic; 96 monthly longitudinal snapshots per household covering income, expenses, balances, and behavioural metrics; the conditional demographic overlay populated for households tagged for B08 and B26; the full schema documentation as a separate deliverable; and the conditional-overlay-and-privacy-contract reference guide.

The deliverable is structured for operational use. Households are organized by archetype with manifest files indicating which bundle overlays apply to each. Schema documentation includes both the field-level reference and the cross-cutting topical guides (longitudinal contract, conditional privacy overlay, audit-trail structure).

The Data Set ships as JSON, CSV, and Parquet, accompanied by the WealthSynth Methodology PDF documenting corpus structure, bundle-eligibility logic, the longitudinal contract, conditional-privacy-contract details, and the integration patterns common to platform-vendor and large-RIA buyers. Annual refresh delivers the full corpus regenerated against current-year tax tables, regulatory guidance, and any new archetype variants added to the catalog.

Preview a sample household

A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 1,451 like it) ships in the ZIP.

F-01·New Graduate Tech Worker
representative archetype household
Household
Single
State
WV
Gross income (band)
$50k–$100k
Net worth (band)
Dependents
0
Income source types
w2 salary, w2 bonus
Members (1)
primary
Age 25–29
professional services

Technical Highlights

Every bundle overlay across all 1,451 households
96 monthly longitudinal snapshots per household
Structured JSON, CSV, and Parquet — single source of truth
Single deliverable, not sum of bundle prices

Sample Schema Fields

sample_record.json
{
  "all_universal_core_fields": <value>,
  "all_30_bundle_overlays": <value>,
  "longitudinal.monthly[96]": <value>,
  "bundle_tags[]": <value>,
  "schema_documentation": <value>
}

Sample queries

Verify archetype-and-bundle coverage

Returns the cross-tabulation of archetype-by-bundle counts in the corpus — useful as a structural verification that the corpus delivery is complete and matches the expected eligibility logic.

households.reduce((acc, h) => {
  for (const bundleTag of h.bundle_tags) {
    const key = `${h.archetype_id}-${bundleTag}`;
    acc[key] = (acc[key] || 0) + 1;
  }
  return acc;
}, {})
Surface households spanning multiple bundle overlays

Returns households tagged for 4+ bundle overlays — the structurally complex households where multi-bundle product features can be tested most thoroughly.

households.filter(h =>
  h.bundle_tags.length >= 4
).map(h => ({
  id: h.id,
  archetype: h.archetype_id,
  bundle_count: h.bundle_tags.length,
  bundles: h.bundle_tags
}))
Validate longitudinal-data integrity

For each household, verifies that exactly 96 monthly snapshots are present and that month-over-month transitions (balance changes, income flows, behavioural-score continuity) are mathematically consistent with the underlying financial structure.

households.filter(h =>
  h.longitudinal.monthly.length !== 96 ||
  !h.longitudinal.monthly.every((m, i, arr) =>
    i === 0 ||
    Math.abs(m.month_index - arr[i-1].month_index) === 1)
)
Audit bundle-overlay completeness

Returns households whose bundle_tags contradict their populated overlay fields — useful for delivery-completeness verification across the full 30-overlay surface.

households.filter(h => {
  const declared = new Set(h.bundle_tags);
  const populated = Object.keys(h.overlays || {});
  return populated.some(b => !declared.has(b)) ||
    [...declared].some(b => !populated.includes(b));
})

Methodology

The Master Corpus is the canonical source: every other Wealth Data Set is a curated subset of this corpus with bundle-specific overlay emphasis. Generation runs through the WealthSynth pipeline with all 30 bundle overlays applied where archetype eligibility holds. Longitudinal data is generated as 96-month sequences per household with realistic seasonal patterns, life-event triggers, and behavioural trajectories. Every field is deterministically derived from the household's archetype template and applicable bundle overlays — no LLM-generated narrative or document artifacts; the JSON is the single source of truth. Every record passes the WealthSynth consistency validator (longitudinal continuity, bundle-overlay reconciliation, conditional-privacy-overlay compliance) and the LLM-as-judge gate. The conditional demographic overlay is populated for households tagged for B08 and B26 only. Annual refresh re-runs the full pipeline against current-year tax tables, regulatory guidance, and any catalog additions.

Included Archetypes (71)

F-01New Graduate Tech Worker
Formation
F-02Gig Economy Starter
Formation
F-03Young Dual-Income Couple (No Kids)
Formation
F-04First-Generation Wealth Builder
Formation
F-05Military Enlisted (Active Duty)
Formation
F-06International Worker (H-1B)
Formation
A-01Young Family — First Home
Accumulation
A-02Single Parent
Accumulation
A-03Dual-Income Professional Couple
Accumulation
A-04Small Business Owner (Early Stage)
Accumulation
A-05Healthcare Professional (Early Career)
Accumulation
A-06Tech Employee with Equity
Accumulation
P-01Peak Earner — Corporate Executive
Accumulation
P-02Established Business Owner
Accumulation
P-03Dual High-Income Professionals
Accumulation
P-04Real Estate Investor
Accumulation
P-05Pre-Retirement Catch-Up
Accumulation
P-06Sudden Wealth Recipient
Accumulation
H-01Affluent Investor ($1M–$3M)
Accumulation
H-02High Net Worth ($3M–$10M)
Accumulation
H-03Ultra High Net Worth ($10M+)
Accumulation
H-04Widowed HNW Spouse
Transfer
R-01Corporate Pre-Retiree (5 Years Out)
Preservation
R-02Self-Employed Pre-Retiree
Preservation
R-03Government/Teacher Pre-Retiree
Preservation
RE-01Active Early Retiree
Distribution
RE-02FIRE Achiever (Early Retirement)
Distribution
RE-03Pension-Rich Retiree
Distribution
RL-01RMD-Stage Retiree
Distribution
RL-02Elderly Widow/Widower
Distribution
S-01Divorce in Progress
Transfer
S-02Bankruptcy Recovery
Formation
S-03Medical Debt Crisis
Accumulation
S-04Caregiver for Aging Parent
Accumulation
U-01Unbanked / Recently Banked
Formation
U-02Low-Income Working Family
Accumulation
U-03Recent Immigrant (Working)
Formation
E-01Millennial Inheritor
Transfer
E-02Estate Planning Client (Grantor)
Preservation
E-03Next-Gen Wealth Recipient (Teen/Young Adult)
Formation
B-01Financial Anxiety / Avoider
Accumulation
B-02Overconfident DIY Investor
Accumulation
B-03Spender / Lifestyle Inflation
Accumulation
N-01Crypto-Heavy Portfolio
Accumulation
N-02ESG / Values-Based Investor
Accumulation
N-03Professional Athlete / Entertainer
Accumulation
N-04Cannabis Industry Worker
Accumulation
RI-01Annuity-Dependent Retiree
Distribution
RI-02Dividend Income Retiree
Distribution
X-01Remote Worker / Digital Nomad
Accumulation
X-02Creator Economy / Influencer
Accumulation
X-03LGBTQ+ Household
Accumulation
X-04Neurodiverse / Disability Household
Accumulation
MB-01First-Time Homebuyer
Accumulation
MB-02Distressed Mortgage / Underwater Homeowner
Accumulation
MB-03Real Estate Investor (DSCR / Portfolio Lender)
Accumulation
SB-01LLC / S-Corp Owner (Pass-Through)
Accumulation
SB-02Solo Practitioner (Professional Services)
Accumulation
SB-03Partnership / Multi-Member Business
Accumulation
HC-01HSA Maximizer / HDHP Household
Accumulation
HC-02Disability Claimant (SSDI / LTD)
Accumulation
HC-03COBRA / Benefits Gap Household
Accumulation
SL-01PSLF Candidate (Nonprofit / Government)
Formation
SL-02Income-Driven Repayment Enrollee
Formation
SL-03Parent PLUS Borrower
Accumulation
MV-02Military Officer (Career / Retirement)
Accumulation
MV-03Disabled Veteran
Accumulation
CR-01Crypto-Heavy / DeFi Investor
Accumulation
ES-01ESG / Faith-Based / Impact Investor
Accumulation
BL-01Blended Family / Step-Children
Accumulation
AR-01Artist / Creative (Royalties & Irregular Income)
Accumulation

Frequently asked questions

What's the price relationship between Master Corpus and individual bundles?+

The Master Corpus is priced as a single deliverable rather than as the sum of bundle prices. Buying it is roughly equivalent to buying 3-4 bundles individually — but you get all 30 bundle overlays plus the full 96-month longitudinal data on every household. For organizations whose product surface spans 4+ bundles, the Master Corpus is the more economical buy.

Is the longitudinal data calendar-anchored?+

No. Each household's 96-month longitudinal series is anchored to a relative `month_0` rather than a calendar date. This avoids baking COVID-era assumptions or any specific economic regime into the corpus. Calendar-anchored data is available on request for backtesting against specific historical periods.

Does the corpus include synthetic documents (W-2s, 1040s, bank statements, etc.)?+

No. The corpus ships as structured JSON profiles only — every income, account, transaction, and tax field is present in the schema, but no rendered documents (W-2 PDFs, 1040 forms, bank statements, brokerage statements, SSA-1099s, etc.) are included. The earlier document-generation pipeline produced artifacts that were inconsistent with the canonical JSON, so we removed it. Documents may return in a future release as pure projections of the JSON; until then, buyers are free to render their own documents from the structured fields.

Is the conditional privacy overlay populated?+

Yes — but only for households tagged for B08 (ESG Values Alignment Auditor) and B26 (Faith-Based & International Households). Per the v4 privacy contract, race/ethnicity and religion fields are NOT in the default household record; they appear only as a conditional overlay for the bundle scopes where the planning question makes them relevant.

How is the corpus updated over time?+

We re-run the full pipeline against current-year tax tables, regulatory guidance updates, and any new archetype variants added during the year. Each refresh delivers a new corpus version with structural improvements and current-year calibration; existing households are not edited in place. Refresh cadence and scope for buyers are still being defined — reach out if regulatory currency is central to your purchase decision.

Can I get a subset of the Master Corpus?+

The Master Corpus is sold as a complete deliverable — it isn't subdivided. The natural subsets are the 30 individual bundles (B01–B30), each of which is sold separately at its own price. If you need a custom-curated slice (e.g., 'every household tagged for both B02 and B16'), filter the Master Corpus directly using the bundle-tags field on each household.

What's the total file-size delivery?+

The corpus delivers approximately 8GB of structured data: 1,451 household JSON files plus longitudinal series, bundle-overlay data, and schema documentation. Parquet-format data is compressed substantially smaller for analytical use. The accompanying Methodology PDF adds a few MB.

Can academic researchers cite the Master Corpus?+

Yes. Academic publications using the Master Corpus may cite WealthSchema as the data source. Citation guidance is included in the documentation. The Data License permits use in published research; commercial republication of the underlying data is restricted. Reach out about academic-research-use licensing for any use case that goes beyond standard publication.

Related Wealth Data Sets

$12,500
one-time purchase
1,451 households (ZIP)
Methodology PDF
JSON, CSV, Parquet formats
Account required to purchase

Purchases are for internal use only. Redistribution or resale of data is prohibited under the WealthSchema Data License.

View data license →