There are four families of techniques for generating synthetic financial data in production. Each has its own competence band, its own failure modes, and its own characteristic bug class. A buyer who doesn't know which family their vendor is in is going to be surprised by which bugs show up — and the surprises tend to be expensive.
This article is the architectural review we wish every prospective buyer would read before the procurement call. It takes no position on which family is best in absolute terms, because there is no best in absolute terms. It takes a strong position on which family fits which use case.
The four families
| Best at | Worst at | Typical bug class | |
|---|---|---|---|
| Rule-based | Hard constraints, regulatory invariants | Diversity, edge cases, narrative richness | Records that are valid but feel mechanical |
| GAN / VAE / Diffusion | Distributional fidelity at scale | Constraint satisfaction, rare events | Records that look real and break invariants |
| LLM-based | Narrative fields, plausibility, edge cases | Numerical determinism, joint constraints across many fields | Hallucinated cross-field inconsistencies |
| Hybrid (rules + LLM, rules + GAN) | Production fidelity at production cost | Engineering simplicity | Failure modes that span the seam between components |
Family 1: rule-based generation
Rule-based generators produce synthetic records by applying a sequence of deterministic transformations to a sampled or seeded input. The classic example is Faker for tabular data; a more sophisticated example is a custom Python pipeline that draws from a state-by-state distribution of incomes, applies federal and state tax brackets, and emits a deterministic 1040 line by line.
The strength of rule-based generation is constraint satisfaction. If your engine demands that gross income minus deductions equals AGI, rule-based generation gives you that identity for free, by construction. If your engine demands that wash-sale-disallowed losses are tracked at lot level with the IRS rule applied taxpayer-wide across linked accounts, rule-based generation can encode the rule directly.
The weakness is everything else. Rule-based generators produce records that satisfy the rules and nothing more. The diversity of the output is bounded by the diversity of the rules — which is bounded by the willingness of the engineering team to encode every edge case explicitly. A real population has a long tail of unusual cases (a household with a Section 1042 ESOP rollover and a qualifying disposition of incentive stock options on the same year), and rule-based generators only have those cases if a developer thought to add them.
Family 2: deep generative models (GAN, VAE, diffusion)
Deep generative models learn the joint distribution of a real-data corpus and sample from the learned distribution to produce synthetic records. GANs and VAEs were the dominant approaches through ~2022; diffusion models are increasingly the state of the art.
The strength of deep generative models is distributional fidelity. A well-trained GAN on a large enough real-data corpus produces synthetic records whose marginal and pairwise joint distributions are nearly indistinguishable from the source. Population-level analyses on the synthetic data give answers within a few percent of population-level analyses on the real data.
The weakness is constraint satisfaction. A GAN that has learned the joint distribution of gross_income, deductions, and agi from a million 1040 records will produce synthetic records where the identity agi = gross_income − deductions holds approximately. "Approximately" is not good enough for downstream tax engines. The GAN does not know that the identity is a hard constraint; it has only learned that the values are correlated.
The other weakness is rare events. Deep generative models trained on real data have the long tail of rare cases drowned out by the mode of common cases. UHNW carryforwards, multi-state filers, IRMAA bracket transitions — the cases your engine has to handle correctly — are the cases the model is least likely to produce. The dataset looks fine in aggregate and fails on the cases that actually matter.
Family 3: LLM-based generation
LLM-based generation is the newest approach in production. The model is given a structured prompt describing the desired record (archetype, constraints, target fields) and asked to emit a JSON record that fills the schema. The recent generation of LLMs (Claude Sonnet 4+, GPT-5+, Gemini 2+) are good enough at structured output and constraint reasoning to produce plausible records with rich narrative detail.
The strength of LLM generation is plausibility and edge-case coverage. An LLM can produce a household with a Section 1042 ESOP rollover and a qualifying disposition of incentive stock options because the rules and the rationale are in its training data. Related: building crypto DeFi tax engines. A rule-based generator would have needed a developer to encode that case; a GAN trained on a normal-distribution corpus would never have produced it.
The weakness is numerical determinism. LLMs hallucinate. A 96-month longitudinal record generated in a single LLM call has a measurable probability of drifting in the back half of the sequence, producing identical-looking but slightly off values, or producing values that violate cross-field invariants the model wasn't paying close attention to. The hallucination rate scales with the number of fields, the length of the sequence, and the complexity of the cross-field dependencies.
P(hallucination) ≈ 1 − (1 − ε_field)^(fields × dependencies × time_steps)- ε_field
- = Per-field hallucination probability (~0.001 for current frontier models on structured tasks)
- fields
- = Field count per record (~500 for a wealth profile)
- dependencies
- = Cross-field dependency count (~20 for tax-aware records)
- time_steps
- = Longitudinal snapshots (96 for monthly over 8 years)
Family 4: hybrid pipelines
The honest answer for production fintech synthetic data is that none of the three pure approaches is sufficient on its own. The hybrid approach combines them, with each component handling what it's best at.
A typical hybrid pipeline looks like this:
- Stage 1Archetype selection and parameterizationRule-based — pick from a named distribution of archetypes, sample population statistics, set explicit invariants the record must satisfy.
- Stage 2Core record generationLLM-based — fill the schema with archetype-conditioned record content, narrative fields, plausible cross-field detail.
- Stage 3Longitudinal projectionLLM-based, chunked — generate 96 monthly snapshots in 3 chunks of 32 to avoid mid-sequence drift; validate seam continuity.
- Stage 4Constraint repairRule-based — enforce hard identities (net worth, AGI, cash flow) by adjusting derived fields; flag if any identity violation requires a >5% adjustment (means the LLM hallucinated badly enough to discard the record).
- Stage 5Validation gateRule-based — run the full battery of identity checks, archetype invariants, cross-field plausibility tests. Strict-fail on any warning.
- Stage 6Quality judgeLLM-based — separate model evaluates the record for plausibility issues the rule-based gates would miss (geographic merchant drift, archetype-narrative mismatch, period-correctness). Verdict drives ship/regenerate.
The hybrid approach is harder to engineer because the failure modes span the seam between components. An LLM stage that produces a slightly-off value can be repaired by the rule stage, but the repair changes the joint distribution, which affects downstream stages. A bug in the validation stage can pass records that the LLM produces but the engine downstream can't handle. Tracking these seam bugs requires the kind of integration testing that purely rule-based or purely deep-model pipelines don't need.
The hybrid approach is also more expensive. Generation cost for a hybrid pipeline at production fidelity runs $0.50–$2.00 per household, versus pennies for rule-based and a few cents for GAN inference. The cost is what it is — production-grade fidelity is not free.
What we ship and why
Our pipeline is hybrid, weighted toward LLM-based generation in the core and longitudinal stages with rule-based archetype seeding and validation. The choice was deliberate: archetype-driven LLM generation gives us the privacy story (no real-data training corpus, sources are public aggregates) and the edge-case coverage (the LLM has seen every regulatory citation in its training data) at the cost of generation expense and the need for strict validation.
The cost we accept is that our generation pipeline is the most expensive of the four families per record. The benefit is that we can produce a 1,500-household corpus covering 71 archetypes with explicit edge-case overlays for under $5,000 of LLM spend — and the dataset passes the strict-fail validation gate that a generic LLM pipeline cannot.
If you're evaluating a vendor, the diagnostic question is "which family are you in." Answers fall into three buckets. "All of the above" without naming the seam — they're one family with a marketing wrapper, almost always rule-based. "Proprietary" — same answer. A specific description of where rule-based seeding hands off to LLM core to GAN time-series, with the validation gate at each interface — they're running a real hybrid and the rest of the conversation can be substantive.
Key takeaways
- Four families of synthetic-data generation: rule-based, deep generative, LLM-based, hybrid. Each has a distinct competence band and a distinct characteristic bug class.
- Rule-based wins for fully-specified, regulation-driven use cases. Deep generative wins for distributional fidelity at scale. LLM-based wins for narrative richness and edge-case coverage. Hybrid wins for production wealth-tech.
- LLM hallucination scales multiplicatively with fields × dependencies × time steps. Single-call long-sequence generation is a bug factory; chunked generation with validation gates is the workable pattern.
- Hybrid pipelines are harder to build but produce records that pass strict validation. Generation cost runs $0.50–$2.00 per household for production fidelity.
- When evaluating vendors, ask explicitly which family they are in and what the seam between components looks like. Vague answers are a tell.