Term

Synthetic Data Validation

Published May 7, 2026
Definition

Synthetic data validation is the process of verifying that generated synthetic data meets schema, referential, and business-rule constraints before being promoted to a production catalog. Common validation categories: arithmetic identities, internal consistency, narrative richness, and archetype fidelity. WealthSchema's strict-fail policy: any warning fails the household; no exceptions.

Synthetic data validation is the difference between audit-grade datasets and toy datasets. Without strict validation, generated data accumulates errors: net-worth that doesn't match assets-minus-liabilities, FICO scores inconsistent with payment history, age inconsistent with life events, dollar amounts that don't reconcile across sections. Each error compounds when buyers depend on the data for backtesting; a single bad household corrupts batch-level analytics.

The WealthSchema validation framework operates at four levels. Arithmetic: net_worth = assets − liabilities ±$100; expense components sum to total ±$10; tax reconciliation. Internal consistency: wealth tier matches net-worth band; FICO consistent with payment history; risk tolerance matches allocation. Richness: ≥3 life events, ≥2 stress scenarios, ≥2 goals; no placeholder text; narrative fields >20 words. Archetype fidelity: income/assets/net-worth in archetype range; occupation/industry consistent; geography accurate.

Any warning at any level flips validation_passed to false — no partial credit. This was the v3 retrospective's biggest fix: the v3 prototype shipped with validation_passed: true alongside warnings, producing data that looked clean but had embedded inconsistencies. The v4 strict-fail policy enforces no-warnings-tolerated; failed households are auto-retried on Opus (more expensive but higher quality), and second failures route to manual review.

Validation runs at multiple stages of the pipeline: per-household after generation, per-bundle after overlay completion, per-corpus before catalog promotion. The cumulative validation effort is meaningful — roughly 15–20% of total pipeline cost — but is the difference between a usable synthetic dataset and one that buyers reject during pilot evaluation.

  1. Stage 1 — Per-household
    Arithmetic + consistency + richness + fidelity
    After each household generation; failures auto-retry on Opus; second failures to manual queue.
  2. Stage 2 — Per-bundle
    Overlay-bundle integration validation
    After overlay completion; ensures bundle-specific fields align with universal-core data.
  3. Stage 3 — Per-corpus
    Catalog promotion validation
    Before promoting to production catalog; ensures bundle-level distributions are correct and no household errors slipped through.
Why this matters for synthetic data

This term is itself the meta-term for the validation framework. Synthetic test data describing the validation framework's output should track per-household: validation_passed flag, list of warnings (if any), severity, validation tier (arithmetic / consistency / richness / fidelity).

Common pitfalls

  • Treating warnings as informational — under strict-fail policy, any warning is a failure.
  • Validating after the entire dataset is generated rather than per-household — wastes generation budget on already-failed households.
  • Skipping richness validation — produces generated data that looks correct but has placeholder text or thin narratives.
  • Forgetting archetype-fidelity validation — produces households outside their archetype's distribution, reducing the dataset's training value.

Examples

Failed validation at the arithmetic gate

Generated household: $2.4M assets, $0.8M liabilities, reported net_worth $1.7M (should be $1.6M). Difference: $100K — exceeds the ±$100 arithmetic tolerance. Validation flags warning: 'arithmetic.net_worth: 1.7M reported vs 1.6M derived (delta 100K)'. validation_passed = false. Auto-retries on Opus; if second failure, routes to manual review queue.