Worksheet

Synthetic-Data Corpus Sizing Worksheet

Published May 10, 2026

The most expensive way to size a synthetic corpus is by guessing. Build too small and you ship bugs the corpus didn't reach; build too large and you waste compliance review hours on cases the feature doesn't exercise. This worksheet computes a defensible floor from four inputs you already know: the feature scope, the edge-case classes that matter, the regression-test cadence, and the regulator's scope. Outputs a recommended Wealth Data Set bundle and a one-paragraph rationale.

What you walk away with

~12 min · 4 sections · 9 fields
  • A computed minimum archetype count grounded in your feature's edge-case classes.
  • A computed longitudinal depth (months) that exercises every time-dependent code path your feature touches.
  • A computed event-density floor — events per household per year — sufficient for your monitoring and detection rules.
  • A recommended Wealth Data Set bundle plus a rationale you can paste into a steering-committee deck.
6 / 9 filled67%

Feature scope

Capture what the feature does and which code paths the corpus must exercise.

The shipping name a PM or examiner would recognize.

What category of wealth-tech feature is this?

How many distinct branches does the feature take across its core flow? Count each materially different path.

paths

Anything specific to this feature that affects sizing — e.g. multi-state, ISO/AMT interaction, beneficiary changes mid-year.

Edge-case classes

Each class adds archetype demand. Tick the ones the feature must defensibly handle.

Each selected class adds archetypes to the floor.

Count of edge-case classes selected above. Auto-fills.

Regression-test cadence and regulator scope

How frequently does the test suite run end-to-end?

Which regulator reads the result of this feature?

What's the minimum trajectory length the feature exercises?

Computed corpus floor

Live calculations from the inputs above. The recommended bundle is a starting point — refine after running the Edge-Case Coverage scorecard against the populated corpus.

Regulator-scope multiplier
2 ×

Higher when more regulators read the feature output — a multi-regulator feature needs a richer corpus floor.

Complexity score
28

Combines code-path count and edge-case classes into a single complexity index.

Minimum archetype count
200 archetypes

Floor for the corpus. Each edge-case class needs ~12 archetypes to exercise its variants; per-merge regression adds 25% to surface flakiness in the corpus, not the test.

Minimum longitudinal depth
24 months

Trajectory length per household. Carryforwards, RMD phase-ins, and IRMAA brackets all require multi-year coherence.

Event-density floor
16 events/household/year

How many material life / financial events each household must exercise per year. Higher for compliance-touching features (AML, GLBA) where signal-density matters.

Next steps

Take the recommended bundle ID into the Wealth Data Sets catalog, or pair this worksheet with the Edge-Case Coverage Score assessment to pressure-test the sizing against your existing corpus.

Key takeaways

  • Sizing isn't about household count — it's about how many distinct code paths each household exercises.
  • Edge-case classes drive archetype count more than feature scope does. Two compliance-touching edge cases > five well-trodden ones.
  • Longitudinal depth is a separate axis. A wide corpus with 12-month trajectories misses every retirement / RMD / IRMAA scenario.
  • Regulator scope multiplies the floor. The same feature shipped to a broker-dealer needs 2-3× the archetypes a no-regulator pilot does.

FAQ

Where does the multiplier come from?

Internal calibration against shipped wealth-tech corpora 2022-2025: the median feature with 8 code paths and 4 edge-case classes shipped with 80-150 archetypes when defensible, < 30 when not. The output is a floor, not a ceiling.

What if my feature isn't on the list?

Pick the closest category. The category affects the longitudinal-depth recommendation more than the archetype count, so getting it within one is fine.

How does this differ from the Maturity Assessment?

The Maturity Assessment scores an existing corpus. This worksheet sizes a new corpus before you have one. Use this first; re-run the Maturity Assessment after you've populated the corpus.