Synthetic data wins on privacy and bias control. Fraud detection is the one fintech ML domain where it cannot win alone. The reason is structural: adversarial signal lives in the long tail of transactions, the adversaries adapt, and a synthetic generator trained on legitimate-behavior distributions produces no realistic adversarial signal. A pure-synthetic fraud model under-performs in production by a margin that doesn't close with more data.
The production pattern that does work is 95/5. Synthetic data for the legitimate-behavior majority, a small curated real-fraud holdout for the adversarial layer. Below: the data shapes, the access controls around the 5% real-fraud carve-out, and the validation gates that make the hybrid defensible to security and compliance review.
What fraud-detection ML actually has to do
A production fraud-detection model has multiple layers, each with different signal requirements:
| Layer | What it models | Data type best suited | |
|---|---|---|---|
| Transaction velocity | Multiple transactions in short windows | Synthetic — patterns are well-documented | |
| Geo-anomaly | Transactions from unusual locations | Synthetic — geography is preservable | |
| Device / behavioral | Typing patterns, device fingerprint, session sequences | Mostly synthetic + real device data | |
| Amount / merchant patterns | Unusual amounts, unusual merchant categories for the cardholder | Synthetic with realistic merchant distribution | |
| Account-takeover signal | Login from new device + change password + change ship address | Mostly real — adversarial sequence patterns are hard to synthesize | |
| Synthetic-identity signal | Identity document inconsistencies, SSN-PII mismatches | Mostly real — adversarial identity construction adapts to defenses | |
| First-payment-default signal | Patterns in the days/weeks before first defaulted payment | Mostly real — outcome data is required |
The split: legitimate-behavior modeling (the first four rows) is amenable to synthetic data. Adversarial-pattern modeling (the last three) needs real signal. A production model has both layers, and the data architecture has to provide both.
The 95/5 split
The architecture pattern that's converged across production fraud teams as of 2026:
- Layer 1Legitimate-behavior synthetic corpus95%+ of the training data. Synthetic transactions, customers, devices, sessions. Captures normal-behavior baseline. Refreshable, scalable, no privacy or PCI scope.
- Layer 2Adversarial real-data holdout5% or less of training data. Curated real fraud cases with the cardholder and identifying information stripped, retained in a high-control environment. Captures adversarial signal that synthetic can't reproduce.
- Layer 3Production telemetryLive signal from production used for ongoing model retraining. Subject to the same privacy and PCI controls as the adversarial holdout.
- Layer 4Synthetic adversarial augmentationLimited synthetic adversarial signal — known patterns synthesized to balance the training set. Useful for class balancing; not a replacement for real adversarial signal.
The legitimate-behavior layer is the largest by volume and the lowest by risk. It can scale freely. The adversarial layer is small, sensitive, and access-controlled.
The synthetic legitimate-behavior corpus
For fraud training specifically, the legitimate-behavior corpus needs:
Synthetic legitimate-behavior corpus requirements
- Realistic transaction distributions per cardholder profile (mass-affluent vs working-class vs HNW have different patterns)
- Merchant category code distributions matching real-world frequencies (groceries dominant, gas/auto/restaurant secondary, etc.)
- Geo distributions matching real-world card-present and card-not-present patterns
- Time-of-day and day-of-week patterns (lunch hours, weekend vs weekday)
- Seasonality (Q4 holiday spikes, summer travel, tax-season anomalies)
- Multi-card households (typical 3-5 cards, with card-specific usage patterns)
- Recurring-transaction patterns (subscriptions, utilities, rent)
- Card-not-present vs card-present distributions per card type
A synthetic corpus that ships with these properties supports the bulk of fraud-model training without privacy or PCI exposure. Ours runs roughly $0.50 per synthetic cardholder-month — affordable for any fraud team needing diverse coverage.
The adversarial holdout architecture
The 5% adversarial layer is where the engineering and compliance work is harder.
The architecture pattern:
- Stage 1Real fraud event captureProduction fraud detection identifies a confirmed fraud event. Full transaction data captured into a controlled queue.
- Stage 2Tokenization at boundaryPAN tokenized, full PII removed. Behavioral signal preserved (amounts, merchants, sequences, device patterns).
- Stage 3Curation and labelingHuman-reviewed labeling — adversarial pattern type, severity, related-event linkage. Labels are themselves valuable for training.
- Stage 4Training-corpus additionTokenized + labeled cases added to the adversarial corpus, available to model training in the high-control environment.
- Stage 5Refresh cadenceQuarterly or monthly refresh of the corpus as new fraud patterns emerge. Model retraining triggered on material changes.
The adversarial corpus is small (low thousands of cases at most for a major issuer, growing with new fraud waves) but disproportionately valuable — most of the model's discriminative signal comes from this layer.
What gets validated where
The validation pattern for hybrid fraud models:
| Validation type | Run against | What it catches | |
|---|---|---|---|
| Detection-rate validation | Held-out real fraud + held-out real legitimate | Whether the model detects real fraud at acceptable false-positive rate | |
| Synthetic-population sweep | Engineered synthetic cases at known difficulty levels | Whether the model handles edge cases — unusual amounts, unusual merchants, geo-velocity | |
| Adversarial-pattern coverage | Curated real adversarial corpus | Whether the model has seen each documented attack pattern | |
| Fair-impact validation | Synthetic cardholders across demographic spread | Whether the model has demographic-driven false-positive disparity |
Each validation has its own data requirement. Engines that run only the first validation (against real data) miss the engineered edge cases that production fraud will eventually exercise.
Common implementation traps
Three patterns we see in fraud-detection deployments that produce post-deployment surprises:
Synthetic data with no adversarial layer. The team builds a clean synthetic corpus, trains a model, deploys it, and watches the model under-detect known fraud patterns. The synthetic distribution didn't include the adversarial tail; the model never learned to flag the patterns. Fix: add real adversarial holdout.
Real-data-only with no edge-case coverage. The team trains on production data only. The model performs well on patterns the production data contained and fails on patterns the production data didn't. Particularly: novel adversarial waves the production data is too old to include. Fix: synthetic engineered edge-case cases for forward-looking robustness.
No fairness validation. The team uses real data without demographic awareness. The model learns to use ZIP-coded census proxies as fraud signal. The fraud detection works statistically and produces disparate-impact false positives on protected classes. Fix: synthetic demographic-controlled fairness validation.
The architecture that survives audit and production is the same architecture in both cases: a synthetic legitimate corpus large enough to scale the legitimate-behavior layer, an access-controlled adversarial holdout small enough to govern the privacy and PCI surface, demographic-controlled fairness sets generated alongside, and the validation gates separated by layer (velocity, device, amount, account-takeover, synthetic-identity) rather than collapsed into a single F1 score.
Key takeaways
- Pure-synthetic fraud training underperforms because adversarial signal lives in the tail. The production answer is hybrid — 95% synthetic legitimate behavior, 5% real adversarial holdout.
- Synthetic legitimate-behavior corpora can scale freely, no privacy or PCI scope. Real adversarial corpora are small, access-controlled, and tokenized at the boundary.
- The model has multiple layers — velocity, geo, device, amount/merchant, account-takeover, synthetic-identity. Different layers have different signal requirements.
- Validation requires four batteries — detection-rate, synthetic edge case sweep, adversarial-pattern coverage, fair-impact. Engines running only one battery miss the others' bug classes.
- Three common traps: synthetic-only (under-detects adversarial), real-only (under-handles edge cases), and no-fairness (disparate false-positive impact). The hybrid architecture addresses all three.