wealthschemaresourcesarticlesWhy we generate 96 monthly longitudinal snapshots per household, not annual
Article

Why we generate 96 monthly longitudinal snapshots per household, not annual

60 months of history, 36 months of projection — generated in three chunks of 32 because anything longer hallucinated.

WealthSchema StaffPipeline architectureMay 7, 20264 min read

The v3 prototype shipped with one annual snapshot per household. The v4 corpus ships with 96 monthly snapshots — 60 months of history, 36 months of projection. That decision tripled our generation cost, exposed three classes of validation bug we hadn't seen before, and was the single largest change between the two versions. This article is why we did it and what we learned.

Net worth$1.18M
Net income / mo$9,547
Cash balance$258K
Three of twenty monthly metrics tracked across the 96-snapshot trajectory of household A-01-seed-1 (Young Family — First Home), from Jan 2020 to Dec 2027. Solid line = historical (60 months); dashed = projected (36 months). The same shape exists for every household in the corpus, computed deterministically from the seeded sampler.

What "longitudinal" actually has to mean

The retirement-planning theme is the cleanest illustration of why annual snapshots fail. A retiree's RMD hits at year-end; their property tax bill hits in January; their estimated quarterly tax payments hit in April, June, September, and January. A household can be solvent on a yearly average and need a margin loan in February.

You cannot model that with a single annual figure. You also cannot derive it from a single annual figure — the within-year cash-flow shape is not recoverable from the year-end balance. We learned this the hard way when our v3 prototype's annual data was being used by a buyer to backtest a margin-call avoidance algorithm, and the algorithm's output was 100% useless because the input had no cash-flow seasonality.

So: monthly snapshots, every household, every bundle.

Why 96 specifically

Sixty months of history is the minimum window that lets you compute trailing 5-year statistics — Sharpe ratios, max drawdown, return correlations. Anything shorter and your buyers are stuck running 3-year statistics on a 5-year backtest, which is not a thing anyone wants to publish.

Thirty-six months of projection is the smallest window that captures the most common planning use case: "where will this household be in three years if nothing changes?" Buyers asked for five but every five-year projection we generated had errors that compounded past the validation gate. Three is honest; five is theater.

Formula
Longitudinal coverage
60 history + 36 projection = 96 months × ~1,400 households = 134,400 snapshots
60
= months of history (anchors trailing 5-year statistics)
36
= months of projection (matches the 3-year planning horizon buyers actually use)
1,400
= households at the v4 baseline corpus size
At one snapshot per row, 134,400 rows is small enough to sit comfortably in a pandas DataFrame and large enough that its statistical properties are stable.

The chunking decision

The first cut at v4 was a single LLM call generating all 96 months at once. That broke. Specifically:

  • The model produced 96 months of output reliably about 60% of the time.
  • The remaining 40% truncated somewhere between months 70 and 90.
  • A non-trivial fraction (we estimate 8%) of the "complete" outputs had silent fabrications past month 60: identical values for consecutive months, or values that drifted into impossible ranges (a 401(k) balance dropping by 90% with no event flag).

We tried prompt engineering. We tried chain-of-thought. We tried structured output schemas. The 60% reliable / 40% broken split persisted. The problem isn't the model — it is the asymmetry between the cost of producing a wrong long sequence and the cost of detecting one.

We had two consecutive months with identical $42,180 mortgage payments — but the schema let through different rate adjustments. The model wrote out the payment correctly the first time and then literally copied the line. We caught it because the property tax in the same month also looked copy-pasted.
Pipeline reviewer · WealthSchema, internal QA · 2026-Q1 v4 review

The fix was three calls of 32 months each. The math:

Why 32 × 3 instead of 48 × 2 or 24 × 4

  • Each chunk fits in a single 4K-token output window with margin for retries
  • Three chunks gives us natural validation seams at month 32 and month 64
  • 32-month chunks all empirically completed reliably (>99% in our sample)
  • The seam discontinuities are detectable: validate that month 32 ending balance equals month 33 starting balance

The cost went from one call per household to three. Generation spend went from ~$0.40 to ~$1.20 per household — material but not the deciding factor at corpus scale (~$1,500 total against a $4,500–$15,000 per-bundle price).

The validation gates we had to add

Monthly data exposed validation classes annual data never did. We added three:

  1. Gate 1
    Seam continuity
    End-of-month-32 balance equals start-of-month-33 balance ±$1. Catches chunk-boundary fabrications introduced by the multi-call pipeline.
  2. Gate 2
    Within-month identity
    Net worth = assets − liabilities ±$100 every single month, not just at year-end. Catches silent drift the annual gate masked.
  3. Gate 3
    Cash-flow plausibility
    Monthly inflow and outflow sum to the change in cash position ±5%. Catches the 'magic money' bug where a retiree's checking account replenishes without a corresponding inflow line.

Each gate fires at validation time. A single failure on any month flips the entire household's validation_passed to false — no partial credit. The strict-fail policy was the v3 retrospective's biggest fix and we are not relaxing it for longitudinal data.

What we got out of it

Three things are now possible that weren't on the v3 prototype:

  1. Sequence-of-returns backtesting. Buyers can replay any 3-year window from the history against their drawdown algorithm and see real path-dependent outcomes — not the smoothed-average outputs annual data produces.
  2. Cash-management testing. Anyone building a margin-call, overdraft, or HELOC-trigger product can drive their engine off realistic monthly cash-flow seasonality.
  3. RMD interaction modeling. The interplay between RMDs (year-end), Social Security (monthly), property tax (Q1), and Roth conversion timing only shows up at monthly granularity. The retirement-income theme covers this in detail.

What we deferred

Daily granularity was on the table briefly. The math didn't work — daily snapshots for 1,400 households over 96 months is 4 million rows per bundle, and the LLM generation cost would dominate the bundle price. More importantly, none of the buyer use cases we surveyed actually needed daily — every wealth-tech engine we talked to ran on monthly aggregation internally regardless of input granularity.

If a buyer ever shows up with a real daily-granularity use case (intraday trading backtests, perhaps), we will revisit. For everything else, monthly is the unit.

Key takeaways

  • Annual snapshots can't model within-year cash flow — and within-year cash flow is what most fintech engines actually depend on.
  • Single-call 96-month generation is unreliable; chunked 3×32 is. The seam discontinuities are detectable and worth the extra cost.
  • Three new validation gates (seam continuity, within-month identity, cash-flow plausibility) caught failure modes the annual pipeline silently passed.
  • 134,400 rows of properly-validated monthly data is the smallest unit that supports real 5-year backtesting — anything less is theater.