wealthschema/data sets/wealthsynth-master-corpus

Master Corpus

Name: Master Corpus
Brand: WealthSchema
SKU: B31
Price: 12500.00 USD
Availability: InStock

The Master Corpus is the foundation every other Wealth Data Set is sliced from. 1,451 synthetic households spanning all 71 archetypes, every bundle overlay applied where eligibility holds, and 96 months of longitudinal data per household — delivered as structured JSON profiles. It is the canonical source-of-truth corpus for organizations building multi-product fintech infrastructure or running structured research programs against synthetic financial data.

Households

1,451

Archetypes

Formats

JSON, CSV, Parquet

Deviation

High

Why this Data Set exists

Buying individual Wealth Data Sets makes sense when the use case is narrow — Reg BI testing, tax-loss-harvesting backtesting, equity-comp planning. For organizations whose product surface spans multiple bundles, buying them individually is operationally expensive: separate purchase events, separate licence agreements, separate ingestion pipelines, separate refresh cycles. The Master Corpus exists for the buyers whose product surface is broad enough that the unit economics inverted: it's cheaper to buy everything than to buy the relevant subset of bundles individually.

The second use case is harder to articulate but equally important: research and infrastructure that doesn't fit any single bundle scope. An academic research program studying intergenerational wealth transfer needs households with longitudinal data and the structural diversity of the full population; a platform vendor building data-pipeline infrastructure for advisor firms needs the realistic structural diversity that only the full corpus provides; a regulatory consultancy building analytics tooling needs the integrated picture across all 30 bundle overlays. None of these use cases buy a single bundle — they buy the corpus.

Use Cases

Platform vendor data infrastructure

Academic research programs

Multi-product fintech development

Internal compliance / training data

Who uses this Data Set

Platform Vendor Building Data-Pipeline Infrastructure

Uses the Master Corpus as the canonical fixture set for data-pipeline development and regression testing. Every advisor-firm customer's data ingestion gets validated against the realistic structural diversity of the 1,451-household corpus before going to production.

Large RIA Building Multi-Product Tools

Uses the corpus for internal tool development across compliance, tax, estate, and behavioural surfaces — buying once and using across multiple product workstreams.

Academic Research Program

Studies wealth-related research questions (intergenerational transfer, behavioural patterns, demographic equity) using a corpus where the longitudinal data and structural diversity support research designs that single bundles can't.

Internal Compliance and Training Data Function

Uses the corpus as the institutional training data set for compliance staff onboarding, audit-readiness exercises, and supervisory-process testing — without exposing any real client data to staff in training.

Regulatory Consultancy

Builds analytics tooling against the full structural diversity of the corpus, demonstrating regulatory analyses to client firms with realistic populations spanning every demographic and life-stage segment regulators care about.

What's inside

The Master Corpus is everything: all 1,451 households across all 71 archetypes from src/data/archetypes-v3.ts; every applicable bundle overlay (B01–B30) populated for each household based on archetype-bundle-eligibility logic; 96 monthly longitudinal snapshots per household covering income, expenses, balances, and behavioural metrics; the conditional demographic overlay populated for households tagged for B08 and B26; the full schema documentation as a separate deliverable; and the conditional-overlay-and-privacy-contract reference guide.

The deliverable is structured for operational use. Households are organized by archetype with manifest files indicating which bundle overlays apply to each. Schema documentation includes both the field-level reference and the cross-cutting topical guides (longitudinal contract, conditional privacy overlay, audit-trail structure).

The Data Set ships as JSON, CSV, and Parquet, accompanied by the Methodology PDF documenting corpus structure, bundle-eligibility logic, the longitudinal contract, conditional-privacy-contract details, and the integration patterns common to platform-vendor and large-RIA buyers. Each refresh delivers the full corpus regenerated against current-year tax tables, regulatory guidance, and any new archetype variants added to the catalog.

Preview a sample household

A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 1,451 like it) ships in the ZIP.

F-01·New Graduate Tech Worker

representative archetype household

Household

Single

State

Gross income (band)

$50k–$100k

Net worth (band)

—

Dependents

Income source types

w2 salary, w2 bonus

Members (1)

primary

Age 25–29

professional services

Technical Highlights

→Every bundle overlay across all 1,451 households

→96 monthly longitudinal snapshots per household

→Structured JSON, CSV, and Parquet — single source of truth

→Single deliverable, not sum of bundle prices

Sample Schema Fields

sample_record.json

{
  "all_universal_core_fields": <value>,
  "all_30_bundle_overlays": <value>,
  "longitudinal.monthly[96]": <value>,
  "bundle_tags[]": <value>,
  "schema_documentation": <value>
}

Sample queries

Verify archetype-and-bundle coverage

Returns the cross-tabulation of archetype-by-bundle counts in the corpus — useful as a structural verification that the corpus delivery is complete and matches the expected eligibility logic.

households.reduce((acc, h) => {
  for (const bundleTag of h.bundle_tags) {
    const key = `${h.archetype_id}-${bundleTag}`;
    acc[key] = (acc[key] || 0) + 1;
  }
  return acc;
}, {})

Surface households spanning multiple bundle overlays

Returns households tagged for 4+ bundle overlays — the structurally complex households where multi-bundle product features can be tested most thoroughly.

households.filter(h =>
  h.bundle_tags.length >= 4
).map(h => ({
  id: h.id,
  archetype: h.archetype_id,
  bundle_count: h.bundle_tags.length,
  bundles: h.bundle_tags
}))

Validate longitudinal-data integrity

For each household, verifies that exactly 96 monthly snapshots are present and that month-over-month transitions (balance changes, income flows, behavioural-score continuity) are mathematically consistent with the underlying financial structure.

households.filter(h =>
  h.longitudinal.monthly.length !== 96 ||
  !h.longitudinal.monthly.every((m, i, arr) =>
    i === 0 ||
    Math.abs(m.month_index - arr[i-1].month_index) === 1)
)

Audit bundle-overlay completeness

Returns households whose bundle_tags contradict their populated overlay fields — useful for delivery-completeness verification across the full 30-overlay surface.

households.filter(h => {
  const declared = new Set(h.bundle_tags);
  const populated = Object.keys(h.overlays || {});
  return populated.some(b => !declared.has(b)) ||
    [...declared].some(b => !populated.includes(b));
})

Methodology

The Master Corpus is the canonical source: every other Wealth Data Set is a curated subset of this corpus with bundle-specific overlay emphasis. Generation runs through the generation pipeline with all 30 bundle overlays applied where archetype eligibility holds. Longitudinal data is generated as 96-month sequences per household with realistic seasonal patterns, life-event triggers, and behavioural trajectories. Every field is deterministically derived from the household's archetype template and applicable bundle overlays — no LLM-generated narrative or document artifacts; the JSON is the single source of truth. Every record passes the consistency validator (longitudinal continuity, bundle-overlay reconciliation, conditional-privacy-overlay compliance) and the LLM-as-judge gate. The conditional demographic overlay is populated for households tagged for B08 and B26 only. Each refresh re-runs the full pipeline against current-year tax tables, regulatory guidance, and any catalog additions.