v4 corpus contract

How synthetic households are built and validated

Synthetic household data is the foundation behind every Wealth Data Set on this site. This page is the public methodology index — written for compliance teams evaluating the data for examiner use, for engineering teams integrating the JSON, and for academic researchers citing the corpus.

The corpus holds itself to a strict internal-consistency contract: if a generated record disagrees with itself in any way — arithmetic, schema, narrative — it is rejected and regenerated. The result is a dataset your data team can trust without spot-checking.

1,451

households validated

P0 failures at ship

monthly snapshots per household

purpose-built bundles

Six commitments behind every record

These commitments are codified in the generation pipeline and tested on every refresh. They are the answer to the only question that matters when buyers evaluate synthetic data: can I trust this?

Schema-first generation

Every household is generated from a single canonical Zod schema. If a record fails the schema, it is discarded and re-generated — invalid data never touches disk.

Consistency contract

All downstream artifacts — overlays, longitudinal trajectories, tax calculations — are pure projections of the canonical household JSON. No secondary math, no renderer drift.

Strict validation gate

Two-pass validation: deterministic checks for arithmetic and schema, then LLM-assisted review for narrative coherence. Any warning fails the household — no soft passes.

Field-level provenance

Every field carries documented type, range, and derivation logic. Methodology PDFs ship with every Data Set so your data team can audit any number end-to-end.

Refreshes with transparent diffs

Tax law changes, market shifts, and new archetypes flow through the same pipeline. Refreshed corpus versions ship with a changelog showing exactly what moved and why. Cadence and pricing are still being defined.

Synthetic by construction

No real individuals, no GDPR exposure, no data use agreements. Sensitive overlays (race/ethnicity, religion) appear only on the bundles that explicitly require them.

Methodology documents

Long-form references covering the full generation pipeline, validation logic, and refresh process. Every bundle ships with a per-bundle Methodology PDF documenting field derivations specific to that Data Set.

How Synthetic Households Are Generated, Validated, and Refreshed

The end-to-end methodology behind every Wealth Data Set: archetype-driven generation, multi-stage overlay application, deterministic consistency validation, LLM-as-judge quality gating, longitudinal projection, and corpus refreshes against current regulatory guidance.

Published May 7, 2026

Want the per-bundle methodology PDF?

Every Wealth Data Set ships with a Methodology PDF describing the field derivations, eligibility rules, and statistical calibration for that specific bundle. Browse the catalog to see which bundle fits your use case.

Browse Wealth Data Sets See the 71 archetypes