Fidelity, privacy, and utility — where the synthetic-data trade-offs actually live

WealthSchema StaffSynthetic data, R&DMay 8, 20266 min read

The standard framing of synthetic data is a trilemma. You can have fidelity (the data looks like the real thing), privacy (the data leaks nothing about real individuals), and utility (the data is useful for downstream tasks). Pick two.

The framing is approximately right and substantively wrong. All three are in tension — that part is right. The tension does not live where the textbook trilemma puts it (in one algorithmic choice), but upstream in three orthogonal knobs across the generation pipeline. Move any one knob and at least one of the other two shifts. The dataset a team ends up with is a function of which knobs they set deliberately and which they left at defaults.

What each term means in practice

The textbook definitions are tidy. The production definitions are slightly different.

Fidelity is not "the synthetic record looks like a real record." It is "the joint distribution of fields in the synthetic dataset matches the joint distribution in the population the dataset claims to represent." A synthetic record can look perfectly plausible in isolation and still degrade dataset-level fidelity if the joint distribution is wrong.

Privacy is not "no PII appears in the dataset." It is "no real individual can be linked to a synthetic record by any joinable combination of fields, even using auxiliary data." A dataset with zero literal PII can still leak if the joint distribution is so close to the source data that membership inference attacks succeed.

Utility is not "models trained on synthetic data perform similarly to models trained on real data." It is "downstream consumers of the dataset achieve their use-case-specific goals at the precision they require." A dataset can be high-utility for one use case (training a fraud-detection feature extractor) and low-utility for another (validating a tax-aware withdrawal sequencing engine) at the same fidelity level.

Why the trilemma is real

There is genuine tension in the math, and the tension does not disappear because we are unhappy with it.

The tightest formal result is from differential privacy: any data release that preserves more than k bits of information about an individual record reduces the achievable privacy budget by an amount that depends on k. Differential privacy ε is measured in nats; producing a dataset where the marginal distribution of account_balance is preserved to two decimal places costs roughly 2-3 ε per record per field. A serious DP-synthetic pipeline for wealth data, with hundreds of fields per record and tight fidelity targets, has to operate at ε > 50 just to be useful — well outside the range any privacy researcher would call meaningfully private.

This is not a fixable engineering problem; it is a property of the math. If you want population-level joint-distribution fidelity sufficient for tax engines and retirement planners, formal differential privacy is not your tool.

The trade-off is real. It just doesn't live where the textbook puts it.

Where the trade-offs actually live

In production, three separable knobs control the joint optimization across fidelity, privacy, and utility. Each knob has different consequences, and the knobs do not move together.

	What it controls	Privacy effect	Fidelity effect	Utility effect
Privacy budget (ε)	Linkability of individual records to source data	Strong — high ε → high fidelity	Use-case dependent	Use-case dependent
Distributional smoothing	Re-identification via tail inference	Erodes long-tail accuracy	Ships smoothed distributions	Hurts edge-case use cases
Edge-case retention	Linkability of rare records	High retention → high tail fidelity	High retention → potential leakage of rare individuals	Critical for edge-case-dependent engines

The first knob, privacy budget, is what most of the literature focuses on. It is the cleanest mathematically and the least useful in production wealth-tech, because the budget required for useful fidelity is too high to count as private.

The second knob, distributional smoothing, is what most commercial vendors actually use under the hood whether they call it that or not. They produce records by sampling from learned marginals or low-dimensional joint distributions and lose long-tail accuracy in the process. The dataset looks fine — until you query a UHNW carry-forward or an immigrant household with a wire-transfer pattern that lives in the tail.

The third knob, edge-case retention, is where production fintech synthetic data actually has to make hard choices. The rare records that your engine needs to handle (UHNW, multi-state filers, IRMAA bracket transitions, foreign-tax-credit cases) are also the records most at risk of inadvertent linkage to real individuals — because there are only a few thousand of them in any plausible source population. Retain the edges and you risk leakage; smooth them and you ship a dataset that fails on the cases that matter.

How we resolve it

Our resolution is archetype-driven generation, and the choice has consequences across all three dimensions.

Instead of training a generative model on real records and sampling synthetic records from the learned distribution, we generate from a set of named archetypes whose population statistics are derived from public aggregates (FRB SCF, IRS SOI, BLS CES). The model fills in record-level detail constrained by the archetype; the archetype constrains the joint distribution; the source data is itself public and aggregated and so cannot be linked back to any individual.

Formula

Archetype-conditional fidelity

P_synth(x) = Σ_a P(x | archetype = a) × P(archetype = a)

P_synth(x): = Synthetic dataset distribution
P(x | archetype = a): = Per-archetype generator output (constrained by archetype invariants)
P(archetype = a): = Archetype population weights, derived from public aggregates

The decomposition matters because it separates the privacy concern (which lives only in the per-archetype weights, all derivable from public aggregates) from the fidelity concern (which lives in the per-archetype generator). Privacy is bounded by construction; fidelity scales with how good the per-archetype generator is.

The trade-off this introduces is one of generator quality. The per-archetype generator has to be good enough to produce records that respect joint constraints across hundreds of fields and 96 longitudinal months — a hard problem, and the engineering investment is the bulk of where the budget for a serious synthetic-data product goes. But the trade-off has shifted: it is no longer "more fidelity = more privacy risk" but "more fidelity = more engineering investment in the generator." That is a much better trade-off to be making.

Why this works for fintech specifically

The archetype-driven approach works for fintech because finance has natural archetype structure that other domains lack. A retiree with a $2M defined-benefit pension and a $500K Roth account is a different archetype from a 38-year-old startup founder pre-IPO with $4M in vested ISOs. They share almost no field-level distributions despite having similar net worth. A general-purpose generator trained on the union of both types ends up smearing across the boundary and producing impossible chimera records.

Archetypes break the chimera problem by making the high-level segmentation explicit and disallowing the pathological joint distributions structurally.

Where the new trade-off bites

Archetype-driven generation has its own trade-off: archetype coverage. Any household that doesn't fit one of the named archetypes is unrepresentable. A 28-year-old single mother who inherited $12M from a great-aunt's estate and is now both a single parent and a UHNW investor — does she fit one of our archetypes? Probably not exactly. We can express her as a 70% match to "young-professional single parent" with a 30% overlay from "inheritance-receiving UHNW," but the exact joint distribution of her case may be weakly represented.

The fix is more archetypes, more overlays, and explicit treatment of overlay-driven joint distributions. v3 of our pipeline shipped 71 archetypes; v4 added five conditional overlays for life events (inheritance, divorce, bankruptcy, immigration, terminal illness). Each archetype × overlay combination requires re-tuning of the per-archetype generator. The investment is real; the privacy posture stays clean.

A working framework for procurement teams

When a procurement team asks us "how should we evaluate the privacy-fidelity-utility trade-off across vendors," we suggest they ask three questions instead.

The three questions that pin down the trade-off

1. What is the source data for the generator? Public aggregates (clean privacy story) vs. licensed real records (privacy risk that depends on every step downstream) vs. internal customer data (auditor-visible privacy risk).
2. How is the joint distribution constrained? Archetype-driven (explicit, inspectable) vs. learned end-to-end (implicit, opaque) vs. rule-based with marginal sampling (rigid, poor on edge cases).
3. What's the formal privacy guarantee, and at what ε? A vendor that claims differential privacy at ε > 10 is not making a meaningful privacy claim; a vendor that claims 'privacy by construction' should be able to explain the construction in 200 words.

The combination of answers tells you what trade-off the vendor has actually made — and whether their trade-off matches the use case you have in mind.

What good looks like

A production-grade synthetic financial dataset in 2026 has, in our view, the following profile across the three dimensions.

Privacy
By construction, not by ε
Source data is public aggregates; per-archetype weights are themselves derivable from public sources. No linkability claim depends on a privacy budget calculation.
Fidelity
Joint-distribution faithful at the archetype level
Per-archetype generators preserve cross-field invariants up to 4-way joint distributions. KL divergence under 0.15 against public references.
Utility
Use-case-validated
Validation gates explicitly for the use cases the dataset is sold to support — TLH, RMD, fair-lending, suitability scoring, illustration validation. A vendor that ships without these is selling generic data and labeling it specific.

The headline is that the trade-off is not "pick two." The trade-off is "pick the right architecture." Archetype-driven generation against public aggregates collapses the privacy axis by construction; once that's done, what's left is fidelity-vs-utility, which any decent generator can be tuned against per use case. The trilemma framing was a property of the generative-model-on-real-data architectures that dominated 2018–2022 thinking. It is increasingly an artifact of those architectures, not a property of synthetic data.

Key takeaways

The textbook trilemma is real, but the trade-offs do not live where the textbook puts them — they live in three separable knobs (privacy budget, distributional smoothing, edge-case retention).
Differential privacy at ε levels useful for wealth-tech is not meaningfully private. The math forces this; engineering doesn't fix it.
Archetype-driven generation against public aggregates collapses the privacy axis by construction, leaving fidelity-vs-utility as the remaining engineering problem.
The new trade-off is archetype coverage. Any household that doesn't fit a named archetype is weakly represented; the fix is more archetypes and explicit overlays for life events.
Procurement teams should ask three questions: source data, joint-distribution constraint, and formal privacy guarantee. The combination tells you the actual trade-off the vendor has made.

Frequently asked questions

Is differential privacy obsolete for synthetic financial data?+

Not obsolete — but its useful regime is narrower than the literature suggests. DP-synth shines for population-level analytics, marginal queries, and small-feature ML. It struggles for record-level fidelity at the field counts wealth-tech engines require. We use DP-synth as a complementary tool for population-level reports, not as the primary architecture for record-level synthetic data.

What's the privacy story when archetype weights are derived from real data?+

Population-aggregate sources (FRB SCF public-use, IRS SOI tabulations) are themselves DP-released or k-anonymized at source. The archetype weights inherit that privacy property without any further work on our part. If we used licensed proprietary data to derive archetype weights, the privacy story would change — and we'd have to disclose the source data and its privacy treatment in the manifest.

How do we test for privacy leakage in a synthetic dataset?+

Membership inference attacks and shadow-model attacks are the standard. Both require a known-real dataset to compare against; both are easier to run if the synthetic-data vendor provides an attack baseline. A vendor that won't or can't provide one is shipping a privacy claim they can't defend. Run the attacks on every refresh, log the AUC, and treat any AUC > 0.55 as a serious leak signal.

Does this analysis change for non-wealth fintech (payments, lending, fraud)?+

The architecture lessons transfer. The specific trade-offs do not — payments has lower joint-distribution complexity than wealth and tighter privacy regimes (PCI DSS); lending has stronger fair-lending overlays; fraud has the unique problem that adversarial signal lives in the tail and smoothing destroys utility. Adjust the analysis per domain; keep the three-knob frame.