Synthetic Data Validation
Synthetic data validation is the process of verifying that generated synthetic data meets schema, referential, and business-rule constraints before being promoted to a production catalog. Common validation categories: arithmetic identities, internal consistency, narrative richness, and archetype fidelity. WealthSchema's strict-fail policy: any warning fails the household; no exceptions.
Synthetic data validation is the difference between audit-grade datasets and toy datasets. Without strict validation, generated data accumulates errors: net-worth that doesn't match assets-minus-liabilities, FICO scores inconsistent with payment history, age inconsistent with life events, dollar amounts that don't reconcile across sections. Each error compounds when buyers depend on the data for backtesting; a single bad household corrupts batch-level analytics.
The WealthSchema validation framework operates at four levels. Arithmetic: net_worth = assets − liabilities ±$100; expense components sum to total ±$10; tax reconciliation. Internal consistency: wealth tier matches net-worth band; FICO consistent with payment history; risk tolerance matches allocation. Richness: ≥3 life events, ≥2 stress scenarios, ≥2 goals; no placeholder text; narrative fields >20 words. Archetype fidelity: income/assets/net-worth in archetype range; occupation/industry consistent; geography accurate.
Any warning at any level flips validation_passed to false — no partial credit. This was the v3 retrospective's biggest fix: the v3 prototype shipped with validation_passed: true alongside warnings, producing data that looked clean but had embedded inconsistencies. The v4 strict-fail policy enforces no-warnings-tolerated; failed households are auto-retried on Opus (more expensive but higher quality), and second failures route to manual review.
Validation runs at multiple stages of the pipeline: per-household after generation, per-bundle after overlay completion, per-corpus before catalog promotion. The cumulative validation effort is meaningful — roughly 15–20% of total pipeline cost — but is the difference between a usable synthetic dataset and one that buyers reject during pilot evaluation.
- Stage 1 — Per-householdArithmetic + consistency + richness + fidelityAfter each household generation; failures auto-retry on Opus; second failures to manual queue.
- Stage 2 — Per-bundleOverlay-bundle integration validationAfter overlay completion; ensures bundle-specific fields align with universal-core data.
- Stage 3 — Per-corpusCatalog promotion validationBefore promoting to production catalog; ensures bundle-level distributions are correct and no household errors slipped through.
This term is itself the meta-term for the validation framework. Synthetic test data describing the validation framework's output should track per-household: validation_passed flag, list of warnings (if any), severity, validation tier (arithmetic / consistency / richness / fidelity).
Common pitfalls
- Treating warnings as informational — under strict-fail policy, any warning is a failure.
- Validating after the entire dataset is generated rather than per-household — wastes generation budget on already-failed households.
- Skipping richness validation — produces generated data that looks correct but has placeholder text or thin narratives.
- Forgetting archetype-fidelity validation — produces households outside their archetype's distribution, reducing the dataset's training value.
Examples
Generated household: $2.4M assets, $0.8M liabilities, reported net_worth $1.7M (should be $1.6M). Difference: $100K — exceeds the ±$100 arithmetic tolerance. Validation flags warning: 'arithmetic.net_worth: 1.7M reported vs 1.6M derived (delta 100K)'. validation_passed = false. Auto-retries on Opus; if second failure, routes to manual review queue.