Synthetic-Data Corpus Sizing Worksheet
The most expensive way to size a synthetic corpus is by guessing. Build too small and you ship bugs the corpus didn't reach; build too large and you waste compliance review hours on cases the feature doesn't exercise. This worksheet computes a defensible floor from four inputs you already know: the feature scope, the edge-case classes that matter, the regression-test cadence, and the regulator's scope. Outputs a recommended Wealth Data Set bundle and a one-paragraph rationale.
What you walk away with
~12 min · 4 sections · 9 fields- A computed minimum archetype count grounded in your feature's edge-case classes.
- A computed longitudinal depth (months) that exercises every time-dependent code path your feature touches.
- A computed event-density floor — events per household per year — sufficient for your monitoring and detection rules.
- A recommended Wealth Data Set bundle plus a rationale you can paste into a steering-committee deck.
Feature scope
Capture what the feature does and which code paths the corpus must exercise.
The shipping name a PM or examiner would recognize.
What category of wealth-tech feature is this?
How many distinct branches does the feature take across its core flow? Count each materially different path.
Anything specific to this feature that affects sizing — e.g. multi-state, ISO/AMT interaction, beneficiary changes mid-year.
Edge-case classes
Each class adds archetype demand. Tick the ones the feature must defensibly handle.
Each selected class adds archetypes to the floor.
Count of edge-case classes selected above. Auto-fills.
Regression-test cadence and regulator scope
How frequently does the test suite run end-to-end?
Which regulator reads the result of this feature?
What's the minimum trajectory length the feature exercises?
Computed corpus floor
Live calculations from the inputs above. The recommended bundle is a starting point — refine after running the Edge-Case Coverage scorecard against the populated corpus.
Higher when more regulators read the feature output — a multi-regulator feature needs a richer corpus floor.
Combines code-path count and edge-case classes into a single complexity index.
Floor for the corpus. Each edge-case class needs ~12 archetypes to exercise its variants; per-merge regression adds 25% to surface flakiness in the corpus, not the test.
Trajectory length per household. Carryforwards, RMD phase-ins, and IRMAA brackets all require multi-year coherence.
How many material life / financial events each household must exercise per year. Higher for compliance-touching features (AML, GLBA) where signal-density matters.
Next steps
Take the recommended bundle ID into the Wealth Data Sets catalog, or pair this worksheet with the Edge-Case Coverage Score assessment to pressure-test the sizing against your existing corpus.
Key takeaways
- Sizing isn't about household count — it's about how many distinct code paths each household exercises.
- Edge-case classes drive archetype count more than feature scope does. Two compliance-touching edge cases > five well-trodden ones.
- Longitudinal depth is a separate axis. A wide corpus with 12-month trajectories misses every retirement / RMD / IRMAA scenario.
- Regulator scope multiplies the floor. The same feature shipped to a broker-dealer needs 2-3× the archetypes a no-regulator pilot does.
FAQ
Where does the multiplier come from?
Internal calibration against shipped wealth-tech corpora 2022-2025: the median feature with 8 code paths and 4 edge-case classes shipped with 80-150 archetypes when defensible, < 30 when not. The output is a floor, not a ceiling.
What if my feature isn't on the list?
Pick the closest category. The category affects the longitudinal-depth recommendation more than the archetype count, so getting it within one is fine.
How does this differ from the Maturity Assessment?
The Maturity Assessment scores an existing corpus. This worksheet sizes a new corpus before you have one. Use this first; re-run the Maturity Assessment after you've populated the corpus.