The WealthSynth Master Corpus is the foundation every other Wealth Data Set is sliced from. 1,451 synthetic households spanning all 71 archetypes, every bundle overlay applied where eligibility holds, and 96 months of longitudinal data per household — delivered as structured JSON profiles. It is the canonical source-of-truth corpus for organizations building multi-product fintech infrastructure or running structured research programs against synthetic financial data.
Buying individual Wealth Data Sets makes sense when the use case is narrow — Reg BI testing, tax-loss-harvesting backtesting, equity-comp planning. For organizations whose product surface spans multiple bundles, buying them individually is operationally expensive: separate purchase events, separate licence agreements, separate ingestion pipelines, separate refresh cycles. The Master Corpus exists for the buyers whose product surface is broad enough that the unit economics inverted: it's cheaper to buy everything than to buy the relevant subset of bundles individually.
The second use case is harder to articulate but equally important: research and infrastructure that doesn't fit any single bundle scope. An academic research program studying intergenerational wealth transfer needs households with longitudinal data and the structural diversity of the full population; a platform vendor building data-pipeline infrastructure for advisor firms needs the realistic structural diversity that only the full corpus provides; a regulatory consultancy building analytics tooling needs the integrated picture across all 30 bundle overlays. None of these use cases buy a single bundle — they buy the corpus.
Uses the Master Corpus as the canonical fixture set for data-pipeline development and regression testing. Every advisor-firm customer's data ingestion gets validated against the realistic structural diversity of the 1,451-household corpus before going to production.
Uses the corpus for internal tool development across compliance, tax, estate, and behavioural surfaces — buying once and using across multiple product workstreams.
Studies wealth-related research questions (intergenerational transfer, behavioural patterns, demographic equity) using a corpus where the longitudinal data and structural diversity support research designs that single bundles can't.
Uses the corpus as the institutional training data set for compliance staff onboarding, audit-readiness exercises, and supervisory-process testing — without exposing any real client data to staff in training.
Builds analytics tooling against the full structural diversity of the corpus, demonstrating regulatory analyses to client firms with realistic populations spanning every demographic and life-stage segment regulators care about.
The Master Corpus is everything: all 1,451 households across all 71 archetypes from src/data/archetypes-v3.ts; every applicable bundle overlay (B01–B30) populated for each household based on archetype-bundle-eligibility logic; 96 monthly longitudinal snapshots per household covering income, expenses, balances, and behavioural metrics; the conditional demographic overlay populated for households tagged for B08 and B26; the full schema documentation as a separate deliverable; and the conditional-overlay-and-privacy-contract reference guide.
The deliverable is structured for operational use. Households are organized by archetype with manifest files indicating which bundle overlays apply to each. Schema documentation includes both the field-level reference and the cross-cutting topical guides (longitudinal contract, conditional privacy overlay, audit-trail structure).
The Data Set ships as JSON, CSV, and Parquet, accompanied by the WealthSynth Methodology PDF documenting corpus structure, bundle-eligibility logic, the longitudinal contract, conditional-privacy-contract details, and the integration patterns common to platform-vendor and large-RIA buyers. Annual refresh delivers the full corpus regenerated against current-year tax tables, regulatory guidance, and any new archetype variants added to the catalog.
A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 1,451 like it) ships in the ZIP.
{
"all_universal_core_fields": <value>,
"all_30_bundle_overlays": <value>,
"longitudinal.monthly[96]": <value>,
"bundle_tags[]": <value>,
"schema_documentation": <value>
}Returns the cross-tabulation of archetype-by-bundle counts in the corpus — useful as a structural verification that the corpus delivery is complete and matches the expected eligibility logic.
households.reduce((acc, h) => {
for (const bundleTag of h.bundle_tags) {
const key = `${h.archetype_id}-${bundleTag}`;
acc[key] = (acc[key] || 0) + 1;
}
return acc;
}, {})Returns households tagged for 4+ bundle overlays — the structurally complex households where multi-bundle product features can be tested most thoroughly.
households.filter(h =>
h.bundle_tags.length >= 4
).map(h => ({
id: h.id,
archetype: h.archetype_id,
bundle_count: h.bundle_tags.length,
bundles: h.bundle_tags
}))For each household, verifies that exactly 96 monthly snapshots are present and that month-over-month transitions (balance changes, income flows, behavioural-score continuity) are mathematically consistent with the underlying financial structure.
households.filter(h =>
h.longitudinal.monthly.length !== 96 ||
!h.longitudinal.monthly.every((m, i, arr) =>
i === 0 ||
Math.abs(m.month_index - arr[i-1].month_index) === 1)
)Returns households whose bundle_tags contradict their populated overlay fields — useful for delivery-completeness verification across the full 30-overlay surface.
households.filter(h => {
const declared = new Set(h.bundle_tags);
const populated = Object.keys(h.overlays || {});
return populated.some(b => !declared.has(b)) ||
[...declared].some(b => !populated.includes(b));
})The Master Corpus is the canonical source: every other Wealth Data Set is a curated subset of this corpus with bundle-specific overlay emphasis. Generation runs through the WealthSynth pipeline with all 30 bundle overlays applied where archetype eligibility holds. Longitudinal data is generated as 96-month sequences per household with realistic seasonal patterns, life-event triggers, and behavioural trajectories. Every field is deterministically derived from the household's archetype template and applicable bundle overlays — no LLM-generated narrative or document artifacts; the JSON is the single source of truth. Every record passes the WealthSynth consistency validator (longitudinal continuity, bundle-overlay reconciliation, conditional-privacy-overlay compliance) and the LLM-as-judge gate. The conditional demographic overlay is populated for households tagged for B08 and B26 only. Annual refresh re-runs the full pipeline against current-year tax tables, regulatory guidance, and any catalog additions.
The Master Corpus is priced as a single deliverable rather than as the sum of bundle prices. Buying it is roughly equivalent to buying 3-4 bundles individually — but you get all 30 bundle overlays plus the full 96-month longitudinal data on every household. For organizations whose product surface spans 4+ bundles, the Master Corpus is the more economical buy.
No. Each household's 96-month longitudinal series is anchored to a relative `month_0` rather than a calendar date. This avoids baking COVID-era assumptions or any specific economic regime into the corpus. Calendar-anchored data is available on request for backtesting against specific historical periods.
No. The corpus ships as structured JSON profiles only — every income, account, transaction, and tax field is present in the schema, but no rendered documents (W-2 PDFs, 1040 forms, bank statements, brokerage statements, SSA-1099s, etc.) are included. The earlier document-generation pipeline produced artifacts that were inconsistent with the canonical JSON, so we removed it. Documents may return in a future release as pure projections of the JSON; until then, buyers are free to render their own documents from the structured fields.
Yes — but only for households tagged for B08 (ESG Values Alignment Auditor) and B26 (Faith-Based & International Households). Per the v4 privacy contract, race/ethnicity and religion fields are NOT in the default household record; they appear only as a conditional overlay for the bundle scopes where the planning question makes them relevant.
We re-run the full pipeline against current-year tax tables, regulatory guidance updates, and any new archetype variants added during the year. Each refresh delivers a new corpus version with structural improvements and current-year calibration; existing households are not edited in place. Refresh cadence and scope for buyers are still being defined — reach out if regulatory currency is central to your purchase decision.
The Master Corpus is sold as a complete deliverable — it isn't subdivided. The natural subsets are the 30 individual bundles (B01–B30), each of which is sold separately at its own price. If you need a custom-curated slice (e.g., 'every household tagged for both B02 and B16'), filter the Master Corpus directly using the bundle-tags field on each household.
The corpus delivers approximately 8GB of structured data: 1,451 household JSON files plus longitudinal series, bundle-overlay data, and schema documentation. Parquet-format data is compressed substantially smaller for analytical use. The accompanying Methodology PDF adds a few MB.
Yes. Academic publications using the Master Corpus may cite WealthSchema as the data source. Citation guidance is included in the documentation. The Data License permits use in published research; commercial republication of the underlying data is restricted. Reach out about academic-research-use licensing for any use case that goes beyond standard publication.
130 synthetic households tuned for Reg BI suitability testing — concentrated holdings, age 75+, recent inheritance, cognitive decline markers, and risk-mismatch flags. Each record carries the eligibility triggers required to exercise broker-dealer supervisory workflows end to end.
110 HNW and UHNW households with estate planning readiness scores, trust structures, gifting histories, charitable giving data, and GST exemption tracking. Complements B09 (Next-Gen Attrition) and B12 (Estate & Trust Planning).
400 prospect households covering RIA client variety from formation through retirement. KYC-complete records, goal-based planning fields, initial recommendation outputs, and CRM-compatible field naming. The broadest single bundle by archetype coverage.
Purchases are for internal use only. Redistribution or resale of data is prohibited under the WealthSchema Data License.
View data license →