Synthetic wealth data for ML engineers — training, validating, and auditing financial models without real customer records

WealthSchema StaffFinancial ML methodologyMay 25, 20265 min read

Financial ML — credit, underwriting, fraud, recommendations, planning, AML, segment classification — sits in a corner of the ML universe with three properties most other ML doesn't share: real training data is harder to get, regulators care meaningfully about what you trained on, and a model that learned the wrong thing costs eight or nine figures to unwind. Synthetic data shows up earlier here and matters more than it does in lower-scrutiny domains.

Below: the regulatory landscape that's reshaped what real customer data financial ML can use, the use cases where synthetic data is structurally superior (not just a privacy workaround), the validation approaches that preserve production generalization, and the four failure modes that distinguish naive practice from defensible practice.

The regulatory environment ML engineers should know about

Three regulatory developments reshape the ML-data landscape in financial services more than any technical change of the last five years.

SR 11-7

Federal Reserve model risk guidance

Three pillars: conceptual soundness, ongoing monitoring, outcomes analysis. Increasingly applied to ML/AI models in regulated FIs. Training-data documentation, dev-team-independent validation, reproducible benchmark monitoring.

Reg. 2024/1689

EU AI Act

High-risk system requirements take full effect August 2026. Credit-scoring, life/health insurance pricing classified high-risk. Triggers training-data quality, provenance, characteristics, and statistical-property documentation.

Circular 2023-03

CFPB adverse-action notification

Creditors using complex algorithms (incl. ML) cannot satisfy ECOA's adverse-action notice by citing the algorithm. Specific reasons must be provided — constraining model architecture and training-data prep upstream.

Circular 2024-05

CFPB overdraft opt-in (illustrative)

Doesn't directly address ML, but illustrates the broader pattern of CFPB attentiveness to algorithmic decisioning at the seam where automated decisions affect consumer outcomes.

These four pieces of regulatory framing don't say "you must use synthetic data." They do say, collectively, that real-customer-data usage in ML training carries documentation, audit, and explainability obligations that synthetic data, used with appropriate methodology, can simplify. Most teams arrive at synthetic data not because regulators require it but because the documentation cost of not using it becomes prohibitive. See SR 11-7 model risk synthetic data and the broader case for AI/ML training data in finance moving off production data.

Where synthetic data is structurally superior to production data for ML

There's a simplistic view of synthetic-data-for-ML that holds: production data is the gold standard, and synthetic data is a privacy-preserving substitute when production data isn't available. That framing under-sells the cases where synthetic data is genuinely the right tool.

Coverage of rare classes. Most financial ML problems have severe class imbalance. Fraud is roughly 0.1–1% of transactions. Default is 1–5% of borrowers depending on segment. Charge-back fraud, AML-flagged activity, account takeover — all rare. Models trained on natural-rate production data require sophisticated rebalancing, oversampling, or focal-loss techniques to learn the rare class. Synthetic data lets you specify the class distribution. You can train on a 50/50 fraud/non-fraud corpus, validate on production-rate-distributed data, and reason explicitly about the rebalancing transformation.

Coverage of edge demographic segments. Production data under-represents demographic populations the firm doesn't serve well. A national lender that's historically under-served minority applicants in certain geographies has production data that reflects that under-service. Models trained on it learn the historical pattern. Synthetic data lets you train on a demographically-balanced corpus, with the demographic distribution explicitly engineered rather than inherited from historical bias.

Counterfactual and adversarial training. Models robust to adversarial inputs are routinely trained on synthetic adversarial examples. Related: Reg B fair lending testing. The same approach extends to counterfactual training — generating synthetic examples that differ from real cases in specific dimensions to teach the model to discriminate on the right features. CFPB Circular 2023-03's interpretability requirements push in this direction.

Time-series scenarios that haven't happened. Recession-stress scenarios, interest-rate-regime changes, demographic shifts. Production data only contains the time series that actually occurred. Synthetic time-series generation lets you train and validate against scenarios you specify.

Cross-firm or cross-product portability. A model trained on Firm A's customers may generalize poorly to Firm B's customers because the customer distributions differ. Synthetic data calibrated to industry-standard distributions produces models that generalize more cleanly across firms.

Regulatory-test scenarios. Models intended to satisfy a specific regulatory test (Reg BI suitability, fair-lending Reg B, AML typology coverage) need training and validation data that exercises the test patterns at sufficient density. Production data rarely has those patterns at sufficient density; synthetic data can be constructed to.

These six cases are not the only legitimate uses of synthetic data in financial ML, but they are the cases where synthetic data is structurally the right answer rather than a privacy-preserving compromise.

The four common failure modes

Synthetic-data-for-ML done badly produces failure modes that real-data practitioners don't always recognize until production. Four to be alert to:

	Failure mode	What it looks like
1. Train ≠ deploy distribution	Model trained on balanced synthetic, rebalanced in prediction layer, deployed to production. Calibration is wrong because rebalancing math was wrong or production shifted.	Validation set must reflect deployment distribution; calibration layer tested against deployment-distribution data.
2. Synthetic-data telltales	Round-number balances, unusual digit distributions, missing-value patterns the model learns. Production lacks these telltales and the model behaves erratically.	Statistical-fingerprint testing against production samples as part of corpus QA.
3. Covariate shift	Corpus calibrated against 2022 wealth distribution, model deployed in 2026 after demographic and asset-distribution change. Silent miscalibration.	Refresh corpora on a documented cadence, at least as frequent as the model's retraining cadence.
4. Validation-set leakage	Validation set 'held out' but generated from the same archetype specifications. Model validates well, generalizes poorly.	Hold out specifications, not just records — different generation specs for training vs. validation.

A defensible synthetic-data ML training methodology

1
Specify the target population explicitly
Demographic, asset, behavior, and life-stage distributions the model needs to perform on. Your training-distribution target.
2
Specify the deployment distribution
What does production look like at inference time? Your validation-distribution target. May or may not match the training target.
3
Generate training corpus from documented specifications
Explicit archetype, life-stage, behavioral specs. Version-control them. Separate specifications (source-of-truth) from records (realized output).
4
Generate validation corpus from a different specification set
Genuinely held out — specifications, not just records. Tests generalization to unseen patterns, not just unseen records.
5
Generate a deployment-distribution validation corpus
Matches production's actual distribution. Used to validate calibration after rebalancing transformations.
6
Document the data provenance
Per EU AI Act training-data documentation requirements: source, generation method, specifications, statistical properties, validation method.
7
Validate on production samples where feasible
Even with strong synthetic data, a final validation against held-out production samples (under appropriate governance) catches calibration drift.
8
Re-train on a documented cadence
Synthetic corpora used for training should refresh at least as frequently as the model retrains. Tax law, demographic shifts, market regime changes all require corpus refresh.
9
Maintain a benchmark set across versions
Persistent benchmark corpus that doesn't change between model versions, allowing direct version-over-version performance comparison.

This methodology preserves generalization to production while satisfying SR 11-7 documentation, EU AI Act provenance requirements, and CFPB interpretability obligations.

What we'd test against

A defensible ML training pipeline using synthetic data should pass these structural tests:

Seven structural tests for a synthetic-data-trained financial model

Production-distribution validation: hold out a production sample; performance on it should match performance on the synthetic deployment-distribution validation set within tolerance.
Counterfactual sensitivity: generate counterfactual examples varying protected attributes (race, gender, geography) while holding other features constant. Predictions should not vary in fair-lending-violating directions.
Rare-class detection: generate examples of the rare class at varied feature configurations. Detection rates benchmarked against the regulatory scenario the model is meant to satisfy.
Temporal robustness: generate examples drawn from time periods or scenarios outside the training data's range. Quantify performance degradation; document domain-of-validity.
Adversarial robustness: generate examples designed to elicit incorrect predictions through subtle perturbations. Robust within natural variation; not confidently wrong on any.
Explainability under load: per prediction, produce an adverse-action-suitable explanation consistent with actual feature contributions.
Fairness across protected segments: performance metrics broken down by protected demographic segment. Disparities quantified, documented, with mitigation strategies where they exist.

Our Master Corpus and the specialized packs (especially the Behavioral Finance Coaching Pack, the CRA / Underserved Lending Pack, and the Insurance Claims & Cybersecurity Risk Pack) are designed to support these tests. The Master Corpus's longitudinal structure (96 monthly snapshots per household) supports the temporal-robustness testing in particular. See longitudinal snapshots financial data.

The "did we test it on real data" question

A common stakeholder challenge for synthetic-data-trained models: "Did you ever test it on real data?"

The right answer is usually "yes, on a held-out production sample, with the results documented." A model whose only validation is against synthetic data has a generalization-to-production claim that depends entirely on the quality of the synthetic data — a credible but harder-to-defend claim. A model that has been validated on a held-out production sample, even a small one, has a much stronger generalization claim.

Train on synthetic; validate on synthetic; final-stage validate on a held-out production sample. The pattern preserves the regulatory benefits of synthetic-data training while answering the 'did we test it on real data' question.

When synthetic data is the wrong answer

To be honest about the limits: synthetic data is the wrong answer when:

The model needs to learn patterns specific to the firm's actual customer base that aren't captured by industry-distribution-calibrated synthetic data
The deployment context requires highly firm-specific calibration
The use case is one where production data is genuinely available, well-labeled, and sufficient — and the regulatory documentation cost is acceptable
The synthetic-data corpus available doesn't cover the specific patterns the model needs

For those cases, schema-preserving synthesis (Tonic, MOSTLY AI, Gretel, SDV) of production data is the right answer, not archetype-driven synthetic data. We've written about the maturity curve that distinguishes these.

Closing

Financial ML in 2026 is an explanation-and-documentation game running on top of a prediction game. The model's accuracy matters; the model's explainability under CFPB Circular 2023-03, its fairness under Reg B, its provenance under SR 11-7, and its audit defensibility under the EU AI Act each individually matter at least as much. Synthetic data with the methodology above simplifies several of those obligations directly — coverage of rare classes, demographic balance, counterfactual training, scenario time-series — and the simplification compounds because each obligation feeds the same training-data documentation package.

For a financial-ML pipeline being stood up in 2026, the right framing is not "use synthetic where production data is unavailable." It's: use synthetic as the structural lever for the coverage, fairness, and explainability properties production data was never going to produce, and reserve real-customer data for the held-out final-stage validation where it's still the right answer.

The Master Corpus is the full 1,451-household, 96-month-snapshot artifact that powers our specialized packs. The free sample on GitHub lets you inspect the schema and the household structure before any commitment.

Key takeaways

SR 11-7, the EU AI Act, and CFPB Circular 2023-03 collectively raised the documentation cost of training financial ML on real customer data — synthetic data is increasingly the path of least documentation friction.
Six use cases where synthetic data is structurally superior (rare classes, demographic balance, counterfactuals, scenario time-series, cross-firm portability, regulatory-test density), not a privacy-preserving compromise.
Hold out specifications, not just records, when constructing the validation set — same-spec held-out records produce the synthetic-data analogue of train/test leakage.
Train on synthetic, validate on synthetic, final-stage validate on a held-out production sample. This pattern preserves regulatory benefits while answering the 'did we test it on real data' question.
Refresh synthetic corpora on the model's retrain cadence — covariate shift between calibration year and deployment year is one of the most-common silent miscalibrations.

Related reading:

This document is general guidance for ML engineers building financial models. It is not legal advice. Models intended to satisfy specific regulatory requirements must be validated against those requirements with appropriate counsel.