Building a robo-advisor on synthetic households — what your test corpus has to do

WealthSchema StaffPipeline architectureMay 9, 20265 min read

A robo-advisor looks, from the outside, like a single product: deposit money, get an asset allocation, watch the dashboard. Inside, it is six modeling problems entangled — onboarding suitability scoring, asset allocation, lot-level tax-loss harvesting, retirement projection, withdrawal sequencing, behavioral-finance nudging — each with its own correctness criteria. The robo-advisors that ship correct answers all six tests are the ones whose test corpus exercises all six. The robo-advisors that ship surprises in production are usually the ones whose corpus only exercised the easy three. See edge cases at every constraint-binding scenario.

This article is the working note we hand fintech engineering teams building robo-advisors. The six modeling problems, the synthetic-data shape each one needs, and the validation gates that catch the bugs that customers eventually find.

What a robo-advisor actually has to model

The minimum-viable robo-advisor — the kind that gets approved by the SEC under Reg BI, holds custody, and can take real customers — has six modeling problems. Each is independently solvable; the engineering complexity comes from making them work together.

	Module	Common bug class
Onboarding suitability	Scoring drift across edge-case demographics; Reg BI documentation gaps	Demographic spread, household-composition variations
Asset allocation	Drift detection; rebalancing thresholds; tax-aware vs naive choice	Multi-account households with realistic balance ratios
Tax-loss harvesting	Lot-level wash-sale across accounts; specific-ID vs FIFO; QSBS handling	Lot-level resolution with 30–300 lots/position
Retirement projection	Stationary IID-normal returns; missed sequence risk; withdrawal seasonality	Monthly longitudinal with 96 snapshots per household
Withdrawal sequencing	Bracket-fill recommendations that miss IRMAA / NIIT / ACA constraints	Constraint-binding edge cases at every threshold
Behavioral nudging	Recommendations that ignore the household's actual cash position	Within-month cash-flow data, not just monthly aggregates

Module 1: onboarding suitability

The first thing a robo does on the customer's signup is collect the inputs that drive a Reg BI suitability assessment — investment objectives, risk tolerance, time horizon, financial situation, tax status. The output is a recommended account structure (taxable / IRA / Roth / 529) and a target asset allocation.

The bugs ship in the edge cases. A 28-year-old with a $40K salary and a $2M inheritance is plausible but rare; a 65-year-old planning to retire in 6 months but still rolling over a 401(k) is the case the robo's onboarding has to handle correctly because it gets sent there by the marketing funnel. Test corpora that don't include these edge cases produce engines that ship "out of risk band" exceptions to real customers.

Module 2: asset allocation

Asset-allocation engines look simple on paper — risk profile in, allocation out — and produce real bugs at scale. The bug classes:

Drift detection. Once the household holds the allocation, it drifts as markets move. The engine has to detect when drift exceeds threshold and rebalance. Engines with poor synthetic test data routinely fail to detect drift in unusual market regimes.
Tax-aware rebalancing vs naive. A naive engine sells whatever's overweight; a tax-aware engine sells the lots with the smallest tax impact, prefers losses over gains, and respects long-term holding-period thresholds. See tax-aware portfolio rebalancer for the lot-selection logic. The two engines produce dramatically different after-tax outcomes.
Multi-account allocation. A household has $300K split across taxable, Roth, and traditional IRA. The engine has to allocate by household totals while placing assets in the most tax-efficient account class — bonds in tax-deferred, equities in taxable, etc. The asset-location decision is the cleanest robo win, and it's only solvable with multi-account data.

The synthetic data this needs: multi-account households with realistic balance ratios across taxable / tax-deferred / Roth, lot-level resolution within each account, and explicit current-value vs cost-basis tracking.

Module 3: tax-loss harvesting

TLH is one of the highest-value robo features and one of the most error-prone in production. The bugs we see most often:

TLH bug inventory

Cross-account wash-sale: harvesting a loss in taxable while buying a substantially-identical security in IRA the next day. The IRS rule is taxpayer-wide, not account-local.
Specific-identification not honored: the engine sells lots in FIFO order despite specific-ID being more tax-efficient.
Holding-period miscalculation: long-term threshold is one year + one day; engines that get this wrong by a day produce short-term gains the customer didn't expect.
QSBS lots accidentally harvested: Section 1202 stock has a 5-year hold to qualify; harvesting it for a small loss can vaporize a much larger future tax exclusion.
Rebalancing-induced wash sales: rebalancing during the 30-day window can trigger wash-sale on a recent harvest, undoing the tax benefit silently.

The synthetic data needed: lot-level resolution at 30–300 lots/position, multiple linked accounts (including non-trading IRAs and HSAs that can still trigger cross-account wash-sales), special-status flags including QSBS, and time-series that span the wash-sale 30+30 day window.

Module 4: retirement projection

The retirement projection module produces the headline output most robo customers actually look at — the "how am I tracking" dashboard. The math behind it is, in honest implementations, a regime-switching Monte Carlo simulation with adaptive withdrawals; in many implementations, it's IID-normal returns with constant inflation-adjusted withdrawals.

The bugs that ship from the simpler version:

Confidence bands that are too narrow (don't capture regime risk)
Sequence-of-returns scenarios that aren't worst-case enough
Withdrawal projections that ignore within-year cash-flow seasonality
"Probability of success" headlines that hide the path-dependent failures

A robo's test corpus needs households with monthly cash-flow patterns, realistic income seasonality (SS month, RMD month, K-1 timing), and projection-window spanning multiple market regimes. The corpus has to support backtest replays of the engine's recommendations against real historical paths.

Module 5: withdrawal sequencing

Once the household enters retirement, the robo's withdrawal-sequencing module decides which accounts to draw from in which order. Related: annuity modeling retirement income. The textbook rule (taxable first, then tax-deferred, then Roth) is wrong for the households who benefit most from optimization — pre-Medicare retirees with ACA subsidies, post-Medicare retirees near IRMAA bracket boundaries, households with significant capital-loss carryforwards.

The synthetic data this needs: households at every constraint-binding scenario. A test corpus where every retiree is post-Medicare with no NIIT exposure tests one branch of the optimizer. The corpus that catches optimizer bugs has households at IRMAA bracket boundaries, at ACA premium-subsidy cliffs, at NIIT thresholds, and with capital gains stacking.

Module 6: behavioral nudging

The robo's behavioral-finance module is what differentiates a "calculator" from a "coach." It nudges the customer toward higher savings, away from market-timing reactions during downturns, toward contributing the IRA limit by deadline. The nudges can be wrong in two directions:

Tone-deaf to actual cash position. Nudging a customer to increase contributions while they're in a Q1 cash crunch produces churn, not behavior change.
Blind to tax interactions. A "contribute to your IRA" nudge sent to a customer whose AGI is already in the IRA contribution phase-out is wrong advice.

The synthetic data this needs: within-month cash-flow tracking, AGI projection, and contribution-deadline calendar awareness.

The cross-module integration tests

The harder bugs aren't in any single module — they're in module interactions. The integration tests we run on every robo we audit:

Test 1
Onboarding → Allocation consistency
The allocation produced by the engine matches the suitability profile from onboarding. A 'conservative' profile that gets a 70/30 allocation is a bug.
Test 2
Rebalancing → TLH coordination
Rebalancing trades don't trigger wash-sales on recently-harvested losses. Engines that schedule rebalancing without considering harvest history fail this routinely.
Test 3
Projection → Withdrawal sequencing alignment
The retirement projection assumes a withdrawal sequence; the actual sequencing module has to match. Mismatches produce projections the engine cannot achieve.
Test 4
Cash-flow → Nudging alignment
Nudges are conditioned on the household's actual within-month cash position, not the year-end total. Test that nudges don't fire during cash crunches.
Test 5
Multi-year regression
Run the full module stack across a 96-month projection. Validate that no module's output contradicts another's. The most common cross-module bug surfaces in this test.

What a production-grade test corpus looks like

A robo-advisor's test corpus, at minimum:

Robo test corpus essentials

200+ households spanning age, income, net worth, and life-stage diversity
Multi-account structure: every household has at least taxable + IRA; many have additional HSA, 401(k), 529, Roth
Lot-level resolution within each account (30–300 lots typical)
96-month longitudinal with monthly cash-flow seasonality
Edge cases: pre-Medicare ACA, post-Medicare IRMAA, NIIT, QSBS, multi-state
Regime-spanning historical-return paths for backtesting
Behavioral-event scenarios: market drops, customer panic events, deadline-driven decisions
Reg BI documentation requirements: every onboarding produces a structured suitability record

A test corpus missing any of these is a corpus that exercises a fraction of the engine. Robos shipping production from incomplete corpora are the robos that ship surprises. The engineering investment in a complete corpus is real — typically $20K–$50K of synthetic data spend for a corpus of meaningful coverage — but a single production incident on a real customer eclipses that cost in regulatory exposure alone.

Key takeaways

A robo-advisor is six modeling problems entangled. The test corpus has to exercise all six, with their interactions, not just the prettiest module. Companion piece: [HNW family office platforms](/articles/building-hnw-family-office-platform).
Onboarding, allocation, TLH, retirement projection, withdrawal sequencing, and behavioral nudging are the six. Each has its own bug class and its own synthetic-data resolution requirement.
Module integration is where the hardest bugs live. The most production incidents we see come from rebalancing-TLH coordination failures and projection-vs-actual sequencing mismatches.
Test corpus essentials: multi-account households at lot-level resolution, 96-month monthly longitudinal, regime-spanning return paths, edge cases at every constraint-binding scenario.
Synthetic data spend for a production-grade robo corpus runs $20K–$50K. Single production incidents on real customers eclipse the cost in regulatory exposure alone.

Frequently asked questions

How much synthetic data do we need vs real customer data?+

For pre-launch testing, all synthetic. The engine has to be correct before any customer touches it, and synthetic data is the only ethical and compliant way to exercise the full edge-case spectrum. Post-launch, real customer data exists but should not be used in development environments without explicit pseudonymization and access controls. Most production robos run a 95/5 synthetic / real-data split for ongoing testing — synthetic for the bulk, real for adversarial-signal coverage.

Should the test corpus reflect the demographic of our target customer base?+

Yes for distribution-realism testing. No for edge-case coverage. The corpus should have two layers: a distributional layer that matches the target customer base for performance testing, and an edge-case layer that over-represents the hard cases. Engines tested only on distributional data ship bugs in the long tail; engines tested only on edge cases produce wrong distributional outputs. Both layers are needed.

How do we test the robo's recommendations against actual outcomes?+

Backtesting against historical return paths is the standard. The corpus has to include households with realistic starting positions and the engine has to produce recommendations that can be replayed against the historical paths. Most robos do this as a post-hoc analysis; the better pattern is to bake backtesting into the regression test suite so every engine version is validated against the same historical paths.

What about ongoing monitoring once the robo is in production?+

Production monitoring is a separate scope from pre-launch testing but uses similar synthetic-data infrastructure. The pattern: keep the synthetic test corpus current (refresh annually or after material engine changes) and use it as a 'shadow' production to detect regression before it hits real customers. A new engine version is validated against the synthetic corpus before being routed to any percentage of real traffic.