How synthetic financial data is actually generated — rules, GANs, LLMs, and hybrid pipelines

WealthSchema StaffPipeline architectureMay 8, 20266 min read

There are four families of techniques for generating synthetic financial data in production. Each has its own competence band, its own failure modes, and its own characteristic bug class. A buyer who doesn't know which family their vendor is in is going to be surprised by which bugs show up — and the surprises tend to be expensive.

This article is the architectural review we wish every prospective buyer would read before the procurement call. It takes no position on which family is best in absolute terms, because there is no best in absolute terms. It takes a strong position on which family fits which use case.

The four families

	Best at	Worst at	Typical bug class
Rule-based	Hard constraints, regulatory invariants	Diversity, edge cases, narrative richness	Records that are valid but feel mechanical
GAN / VAE / Diffusion	Distributional fidelity at scale	Constraint satisfaction, rare events	Records that look real and break invariants
LLM-based	Narrative fields, plausibility, edge cases	Numerical determinism, joint constraints across many fields	Hallucinated cross-field inconsistencies
Hybrid (rules + LLM, rules + GAN)	Production fidelity at production cost	Engineering simplicity	Failure modes that span the seam between components

Family 1: rule-based generation

Rule-based generators produce synthetic records by applying a sequence of deterministic transformations to a sampled or seeded input. The classic example is Faker for tabular data; a more sophisticated example is a custom Python pipeline that draws from a state-by-state distribution of incomes, applies federal and state tax brackets, and emits a deterministic 1040 line by line.

The strength of rule-based generation is constraint satisfaction. If your engine demands that gross income minus deductions equals AGI, rule-based generation gives you that identity for free, by construction. If your engine demands that wash-sale-disallowed losses are tracked at lot level with the IRS rule applied taxpayer-wide across linked accounts, rule-based generation can encode the rule directly.

The weakness is everything else. Rule-based generators produce records that satisfy the rules and nothing more. The diversity of the output is bounded by the diversity of the rules — which is bounded by the willingness of the engineering team to encode every edge case explicitly. A real population has a long tail of unusual cases (a household with a Section 1042 ESOP rollover and a qualifying disposition of incentive stock options on the same year), and rule-based generators only have those cases if a developer thought to add them.

Family 2: deep generative models (GAN, VAE, diffusion)

Deep generative models learn the joint distribution of a real-data corpus and sample from the learned distribution to produce synthetic records. GANs and VAEs were the dominant approaches through ~2022; diffusion models are increasingly the state of the art.

The strength of deep generative models is distributional fidelity. A well-trained GAN on a large enough real-data corpus produces synthetic records whose marginal and pairwise joint distributions are nearly indistinguishable from the source. Population-level analyses on the synthetic data give answers within a few percent of population-level analyses on the real data.

The weakness is constraint satisfaction. A GAN that has learned the joint distribution of gross_income, deductions, and agi from a million 1040 records will produce synthetic records where the identity agi = gross_income − deductions holds approximately. "Approximately" is not good enough for downstream tax engines. The GAN does not know that the identity is a hard constraint; it has only learned that the values are correlated.

The other weakness is rare events. Deep generative models trained on real data have the long tail of rare cases drowned out by the mode of common cases. UHNW carryforwards, multi-state filers, IRMAA bracket transitions — the cases your engine has to handle correctly — are the cases the model is least likely to produce. The dataset looks fine in aggregate and fails on the cases that actually matter.

Family 3: LLM-based generation

LLM-based generation is the newest approach in production. The model is given a structured prompt describing the desired record (archetype, constraints, target fields) and asked to emit a JSON record that fills the schema. The recent generation of LLMs (Claude Sonnet 4+, GPT-5+, Gemini 2+) are good enough at structured output and constraint reasoning to produce plausible records with rich narrative detail.

The strength of LLM generation is plausibility and edge-case coverage. An LLM can produce a household with a Section 1042 ESOP rollover and a qualifying disposition of incentive stock options because the rules and the rationale are in its training data. Related: building crypto DeFi tax engines. A rule-based generator would have needed a developer to encode that case; a GAN trained on a normal-distribution corpus would never have produced it.

The weakness is numerical determinism. LLMs hallucinate. A 96-month longitudinal record generated in a single LLM call has a measurable probability of drifting in the back half of the sequence, producing identical-looking but slightly off values, or producing values that violate cross-field invariants the model wasn't paying close attention to. The hallucination rate scales with the number of fields, the length of the sequence, and the complexity of the cross-field dependencies.

Formula

LLM hallucination scaling (rough)

P(hallucination) ≈ 1 − (1 − ε_field)^(fields × dependencies × time_steps)

ε_field: = Per-field hallucination probability (~0.001 for current frontier models on structured tasks)
fields: = Field count per record (~500 for a wealth profile)
dependencies: = Cross-field dependency count (~20 for tax-aware records)
time_steps: = Longitudinal snapshots (96 for monthly over 8 years)

At realistic numbers (500 × 20 × 96), the joint probability of at least one hallucination per record is essentially 1.0. The takeaway is not that LLMs can't be used — it's that they require chunked generation, validation gates, and retry loops. A single-call LLM pipeline is a bug factory. See [common synthetic data mistakes fintech](/articles/8-synthetic-data-mistakes).

Family 4: hybrid pipelines

The honest answer for production fintech synthetic data is that none of the three pure approaches is sufficient on its own. The hybrid approach combines them, with each component handling what it's best at.

A typical hybrid pipeline looks like this:

Stage 1
Archetype selection and parameterization
Rule-based — pick from a named distribution of archetypes, sample population statistics, set explicit invariants the record must satisfy.
Stage 2
Core record generation
LLM-based — fill the schema with archetype-conditioned record content, narrative fields, plausible cross-field detail.
Stage 3
Longitudinal projection
LLM-based, chunked — generate 96 monthly snapshots in 3 chunks of 32 to avoid mid-sequence drift; validate seam continuity.
Stage 4
Constraint repair
Rule-based — enforce hard identities (net worth, AGI, cash flow) by adjusting derived fields; flag if any identity violation requires a >5% adjustment (means the LLM hallucinated badly enough to discard the record).
Stage 5
Validation gate
Rule-based — run the full battery of identity checks, archetype invariants, cross-field plausibility tests. Strict-fail on any warning.
Stage 6
Quality judge
LLM-based — separate model evaluates the record for plausibility issues the rule-based gates would miss (geographic merchant drift, archetype-narrative mismatch, period-correctness). Verdict drives ship/regenerate.

The hybrid approach is harder to engineer because the failure modes span the seam between components. An LLM stage that produces a slightly-off value can be repaired by the rule stage, but the repair changes the joint distribution, which affects downstream stages. A bug in the validation stage can pass records that the LLM produces but the engine downstream can't handle. Tracking these seam bugs requires the kind of integration testing that purely rule-based or purely deep-model pipelines don't need.

The hybrid approach is also more expensive. Generation cost for a hybrid pipeline at production fidelity runs $0.50–$2.00 per household, versus pennies for rule-based and a few cents for GAN inference. The cost is what it is — production-grade fidelity is not free.

What we ship and why

Our pipeline is hybrid, weighted toward LLM-based generation in the core and longitudinal stages with rule-based archetype seeding and validation. The choice was deliberate: archetype-driven LLM generation gives us the privacy story (no real-data training corpus, sources are public aggregates) and the edge-case coverage (the LLM has seen every regulatory citation in its training data) at the cost of generation expense and the need for strict validation.

The cost we accept is that our generation pipeline is the most expensive of the four families per record. The benefit is that we can produce a 1,500-household corpus covering 71 archetypes with explicit edge-case overlays for under $5,000 of LLM spend — and the dataset passes the strict-fail validation gate that a generic LLM pipeline cannot.

If you're evaluating a vendor, the diagnostic question is "which family are you in." Answers fall into three buckets. "All of the above" without naming the seam — they're one family with a marketing wrapper, almost always rule-based. "Proprietary" — same answer. A specific description of where rule-based seeding hands off to LLM core to GAN time-series, with the validation gate at each interface — they're running a real hybrid and the rest of the conversation can be substantive.

Key takeaways

Four families of synthetic-data generation: rule-based, deep generative, LLM-based, hybrid. Each has a distinct competence band and a distinct characteristic bug class.
Rule-based wins for fully-specified, regulation-driven use cases. Deep generative wins for distributional fidelity at scale. LLM-based wins for narrative richness and edge-case coverage. Hybrid wins for production wealth-tech.
LLM hallucination scales multiplicatively with fields × dependencies × time steps. Single-call long-sequence generation is a bug factory; chunked generation with validation gates is the workable pattern.
Hybrid pipelines are harder to build but produce records that pass strict validation. Generation cost runs $0.50–$2.00 per household for production fidelity.
When evaluating vendors, ask explicitly which family they are in and what the seam between components looks like. Vague answers are a tell.

Frequently asked questions

Does diffusion replace GANs for synthetic financial data?+

For tabular data, increasingly yes — diffusion models score better on most fidelity benchmarks than GANs as of 2025. For mixed tabular + sequential data (households with longitudinal records), the picture is more mixed. Diffusion models for sequence data are still actively researched, and most production wealth-tech vendors have stayed with GAN/VAE-based time-series approaches or moved directly to LLM-based generation. The right answer in two years is probably diffusion + LLM hybrid, but that's not the production state of the art today.

Can I use GPT-5 / Claude / open-source LLMs to generate synthetic data myself?+

Yes, with caveats. You can produce plausible records with a few hundred lines of glue code. You will discover the hallucination scaling problem the first time you try a 96-month longitudinal record, and the validation problem the first time a record passes your eyeball check and breaks your downstream engine. The path from 'works on a few demo records' to 'production-grade strict-validation pipeline' is roughly six engineering months. If that fits your roadmap, build. If it doesn't, buy.

How does generation cost scale with archetype count?+

Roughly linearly. The dominant cost is per-record LLM tokens, and per-archetype overhead (prompt engineering, validation tuning) is a one-time cost. A 71-archetype corpus at 20 records per archetype costs essentially the same per record as a 5-archetype corpus at 200 records per archetype, but the 71-archetype version covers more edge cases and is more useful for general-purpose engines. Buyers usually under-buy on archetype coverage and over-buy on records per archetype.

Does the validation step itself need to be LLM-based?+

The hard-identity gates (net worth, AGI, cash flow) should be rule-based — deterministic, fast, cheap. The plausibility judge (does the record narratively make sense, do the merchant locations match the household geography, is the timeline coherent) is hard to encode as rules and benefits from an LLM. Production pipelines run rule-based validation as a fast first gate and an LLM judge as a slower second gate, with both required to pass.