wealthschemaresourcesarticlesMonte Carlo for retirement — where the standard libraries break
Article

Monte Carlo for retirement — where the standard libraries break

Lognormal returns, fixed correlations, constant withdrawals, no regime switching. Each assumption is benign in a 30-day option model and load-bearing wrong in a 30-year retirement model. The standard libraries default to all four.

WealthSchema StaffPipeline architectureMay 8, 20266 min read

Monte Carlo has been the standard answer to "project a 30-year retirement plan" for thirty years. The libraries — numpy.random.normal, the GBM simulators in every quant package — make it cheap. Ten thousand paths, a 90% success probability, a clean fan chart of percentile bands. The output looks rigorous.

Routinely-implemented retirement Monte Carlo produces overconfident plans. Four assumption failures drive most of the overconfidence; the standard libraries do not flag any of them. Related: robo-advisor synthetic data requirements. Below: which assumptions fail, by how much, and what a production simulator has to do instead.

What retirement Monte Carlo is supposed to model

The use case is straightforward in concept. Given a starting balance, a withdrawal plan, an asset allocation, and a return distribution, project the balance forward 30+ years and report the probability that the household runs out of money. The output is consumed by financial planners, robo-advisors, and increasingly by AI agents that produce planning advice directly to consumers. Related: high-net-worth family office.

The model has high stakes. A 95% success probability that should have been 75% sells households a retirement they can't afford. Companion pieces: CRT CLAT estate planning models, AG 49-A IUL illustration modeling, 1031 exchange tax planning, and insurance illustration software testing. A 75% probability that should have been 90% pushes households into unnecessary frugality during years they could have enjoyed. The error bars on the model output have direct welfare consequences for real people, and the model is the load-bearing piece of every retirement-planning product in market.

Failure 1: independent normal returns

The default assumption in numpy.random.normal and its cousins is independent and identically distributed (IID) returns drawn from a normal distribution. Real equity returns are neither independent nor normal.

Equity returns exhibit volatility clustering: high-volatility periods are followed by high-volatility periods, low by low. The unconditional distribution looks roughly normal; the conditional distribution given recent realized volatility is much wider. A simulator that ignores volatility clustering systematically underestimates the probability of multiple bad years in a row — and multiple bad years in a row is exactly the failure mode that breaks retirement plans.

Real equity returns also have fat tails. The empirical distribution of monthly S&P 500 returns over the last 80 years has kurtosis around 8–10 versus 3 for a normal distribution. Five-sigma events occur roughly every 18 months. The 1987 crash, the 2008 financial crisis, the 2020 COVID crash, and the 2022 tech selloff are all real events that the IID-normal model would assign probability essentially zero.

Failure 2: stationary distributions

The standard simulator assumes return distributions are stationary — the parameters of the return process do not change over the 30-year projection horizon. This is wildly inconsistent with both economic theory and historical experience. Inflation regimes change. Interest-rate regimes change. Equity-risk-premium regimes change. See estate planning 2026 sunset modeling for a related regime-shift problem. The 1970s were not the 1990s and were not the 2010s.

A stationary simulator picking parameters from one regime will project that regime forward for 30 years. If you calibrated to 2009–2021 (a low-inflation, low-rate, high-equity-return regime), you produced retirement plans that look spectacular and break in any other regime. If you calibrated to 1970–1985 (a high-inflation, high-rate, mediocre-equity regime), you produced retirement plans that are absurdly conservative for any subsequent regime.

The fix is regime-switching simulation: sample regime states (low/high inflation, low/high rates, low/high equity premium) from a Markov process whose transition matrix is calibrated to historical regime persistence. Within each regime, sample returns from the regime-conditional distribution. The simulator output is no longer a single fan chart — it is a fan chart over regime paths, which is dramatically wider than the stationary version.

 Stationary IID-normalRegime-switching with fat tails
Width of 30-year terminal distributionNarrow — confidence is fictionalWide — confidence reflects real ambiguity
Sequence-of-returns sensitivityUnderweights early bad yearsCaptures the real impact of early-retirement crashes
Inflation modelingIgnored or constantCo-varies with rate and equity regimes
Computational costTrivial10–50× more expensive but fits in a single API call
Validation difficultyLow — math is closed-formHigher — requires regime-classification validation

Failure 3: clean withdrawals

The standard Monte Carlo treats withdrawals as a constant inflation-adjusted amount drawn at the start of each year. Real retirement withdrawals are nothing like this.

Real retirees have lumpy expense profiles: property tax in Q1, estimated tax payments in Q2 / Q3 / Q4 / January, RMDs forced at year-end, healthcare bills concentrated in late life, long-term-care expenses concentrated in the last 1–3 years. Real retirees also adjust withdrawals to market conditions — guard-rails strategies, dynamic spending rules, the well-documented behavioral pattern of households reflexively cutting spending after a market drop. Companion piece: annuity modeling synthetic data. A simulator that ignores within-year timing produces plans that look fine on paper and fail under within-year cash crunches; a simulator that ignores adaptive spending overstates portfolio depletion in bad scenarios.

Formula
Realistic withdrawal model
W_t = base_t × inflation_adj_t × guard_rail(portfolio_t / target_t) + lumpy_t + tax_t
base_t
= Initial withdrawal rate × starting balance (the 4%-rule baseline)
inflation_adj_t
= Cumulative inflation since retirement, regime-conditional
guard_rail
= Spending adjustment based on portfolio-vs-target ratio (0.85–1.15 typical band)
lumpy_t
= Property tax, healthcare deductible, LTC expense — concentrated by month
tax_t
= Federal + state taxes on RMDs, conversions, withdrawals — varies by year
Each component is independently calibrate-able and matters in production. Skipping the guard-rail term overstates plan failure rates by 10–20%; skipping the lumpy term hides within-year cash-crunch failures entirely.

Failure 4: probability of success as the wrong metric

The headline output of retirement Monte Carlo is usually "probability of success" — the percentage of paths in which the household never runs out of money. This is the wrong metric for almost every real decision.

A 90% success probability means 10% of paths end in failure, but it does not say how they fail. A path that runs out at age 92 is a different failure than one that runs out at age 75; a path where the household ate cat food for the last decade is a different failure than one where they spent down to a normal level and then needed minor lifestyle adjustments. The probability metric collapses all these into a single binary.

The metrics that actually drive sensible decisions are conditional ones: median real consumption in the worst-decile path; probability of a >20% real spending cut at any age; expected age at portfolio depletion conditional on depletion occurring. These metrics are not harder to compute — they are computed from the same simulation paths — but they are not the default in standard libraries, so most products don't show them.

What a production-grade simulator does

A production-grade retirement Monte Carlo, in our view as of 2026, has the following non-negotiable properties:

  1. Property 1
    Regime-switching return model
    Markov-chain regime states with transition matrix calibrated to long-horizon historical regime persistence. At minimum, low-vol vs. high-vol equity regime + low-rate vs. high-rate regime.
  2. Property 2
    Fat-tailed conditional distributions
    Within-regime returns from a Student-t or skew-t distribution, not Gaussian. Tail parameters calibrated to historical empirical kurtosis.
  3. Property 3
    Co-varying inflation
    Inflation drawn jointly with returns and rates, not independently. The 1970s pattern (high inflation + mediocre real equity returns) must be reachable in simulation.
  4. Property 4
    Adaptive withdrawal model
    Guard-rail spending rule + lumpy expenses + explicit tax modeling. Pure-flat withdrawals are for spec demos, not for real product output.
  5. Property 5
    Decision-relevant output
    Conditional metrics on top of probability of success: worst-decile real consumption, probability of large real cuts, expected age at depletion. The dashboard tells the planner what they need to know.
  6. Property 6
    Sensitivity transparency
    User-controllable inputs explicitly call out their effect on the output. A 10 bp change in equity premium assumption shifts success probability by N points; the user sees this.

The cost of doing this right is roughly an order of magnitude more engineering than the IID-normal version — but the IID-normal version is producing wrong answers. The right comparison is not "simple vs complex" but "calibrated vs uncalibrated."

Why this connects to test data

A wealth-tech product running a retirement Monte Carlo needs test data that exercises the simulator's edge cases. A test corpus of mid-career households who all retire at 65 with normal asset allocations will pass any simulator's smoke test. See longitudinal synthetic financial data design for the time-series structure that exposes these bugs. The simulator that breaks in production is the one tested only against this corpus.

The corpus that catches simulator bugs has explicit edge cases: households retiring into a crash year, households with concentrated equity positions, households with significant Roth conversion windows, households with lumpy late-life expenses, households facing IRMAA bracket transitions during their projection. Add HSA-as-stealth-retirement households to that list — see HSA investment and triple-tax-advantage modeling for the account class most simulators routinely fold into "checking" and miss. A simulator validated only on a 65-year-old retiring with 60/40 in two accounts has approximately zero coverage of the cases that actually break in production; the corpus has to carry the regime states, the withdrawal rules, and the edge demographics that exercise every branch the simulator can take.

Key takeaways

  • Standard Monte Carlo libraries assume IID-normal returns. Real returns are neither independent nor normal — fat tails and volatility clustering matter for retirement projection.
  • Stationary distributions assume the next 30 years look like the last 12. They will not. Regime-switching is mandatory for any honest 30-year projection.
  • Withdrawals are lumpy, tax-aware, and adaptive. Constant inflation-adjusted withdrawals are a simulation convenience, not a model of real households.
  • Probability of success is the wrong headline metric. Conditional metrics — worst-decile real consumption, probability of large real cuts — drive better decisions.
  • A production-grade simulator costs roughly 10× the engineering of the textbook version. The textbook version is producing wrong answers; the cost is real and worth it.

Frequently asked questions

Doesn't bootstrapping historical returns avoid the fat-tail problem?+
Partially. Bootstrap simulation samples from the empirical historical distribution and inherits its fat tails by construction. It does not, however, capture regime persistence within draws — pure bootstrap with replacement scrambles the regime structure. Block bootstrap (sampling consecutive 12- or 24-month windows) preserves some of the structure but at the cost of paths that are concatenations of historical regimes rather than coherent regime-switching paths. We've found regime-switching parametric simulation more transparent and easier to validate, but bootstrap remains a defensible alternative.
How important is correlation between asset classes?+
Critically important and routinely under-modeled. Equity-bond correlations are not constant — they are negative in some regimes (the 'flight to quality' regime) and positive in others (the 'simultaneous selloff' regime, as in 2022). A simulator that uses a constant correlation matrix will produce diversification benefits in scenarios where they evaporate in real life. Regime-switching covariance matrices are an extension of the regime-switching return models and should be tied to the same regime states.
What about Social Security and annuity income — are those handled the same way? See [SECURE 2.0 RMD engineering](/articles/rmd-secure-act-2-engineering-rebuild) and [Social Security claiming optimization](/articles/social-security-claiming-optimization) for the underlying mechanics.+
No, and they shouldn't be. SS and lifetime annuities are deterministic-real-cash-flow streams (subject to cost-of-living adjustments and the small risk of legislative reform). Modeling them as draws from a return distribution is wrong. The right approach is to subtract guaranteed real income from gross required spending, run the Monte Carlo only on the residual, and report results in terms of portfolio coverage of the residual.
How do we test our retirement Monte Carlo against synthetic data?+
Three test patterns: (1) calibration test — generate synthetic households with known retirement outcomes and verify the simulator reproduces the outcome distribution; (2) edge-case test — synthetic households at retirement timing edges (just-retired-into-crash, late retiree with depleted balance, early retiree with high equity), verify simulator handles each; (3) regression test — historical-vintage simulations against actual outcomes for households retiring in each decade since 1950. Each pattern needs different synthetic data; the test corpus should support all three.

For the methodology of the regime-switching returns generation referenced above, see Generating synthetic historical returns: random walk, regime-based, replay — and the methodology comparison Synthetic time series vs. historical replay for when each approach belongs in a Monte Carlo workflow. The umbrella view of the time-series concerns this article touches is at Time-Series Fidelity in Synthetic Wealth Data.