How to Backtest a Tax-Loss Harvesting Algorithm with Synthetic Households
Tax-loss harvesting algorithms are increasingly the differentiator in the direct-indexing wars. The differences in TLH yield from one implementation to another can compound to 100+ basis points of after-tax return — the difference between a competitive product and an also-ran. But the algorithms can't be backtested on cartoon data. Backtesting requires test fixtures with the structural properties real taxable accounts exhibit: lot-level cost basis, realistic acquisition-date distributions, wash-sale conflicts that fire at calibrated frequency, and QSBS-eligible positions whose holding-period and gross-asset-test interact non-trivially with TLH decisions. This guide walks through the methodology.
What 'realistic' means for a TLH backtest
There are three dimensions of realism that matter for TLH backtesting and that most synthetic fixtures fail to capture.
First, lot-level structure. A real taxable account isn't a single position with an average cost basis; it's a stack of lots, each with its own acquisition date and cost basis, that the TLH algorithm picks among using specific-ID accounting. Test fixtures that aggregate to position-level averages produce TLH yields that are systematically wrong because the harvest opportunity is concentrated in specific underwater lots, not in the position average.
Second, cross-account wash-sale dynamics. The IRS wash-sale rule disallows a loss when substantially identical securities are repurchased within 30 days — including in the spouse's IRA, the same investor's other accounts, or even certain options positions. About 6% of would-be-harvestable lots in a real-world investor's account are blocked by cross-account conflicts. Test fixtures that ignore this dimension over-estimate TLH yield by a similar amount.
Third, QSBS interactions. For founder and early-employee clients, the Section 1202 holding-period and gross-asset-test interact with the TLH decision — selling a QSBS-eligible position before the 5-year mark forfeits the federal capital-gains exclusion. A TLH algorithm that doesn't recognise QSBS positions can recommend harvesting against tax savings of 20-25% while triggering a forfeiture worth 23.8%.
The backtest pipeline
A defensible TLH backtest pipeline has three stages. Each stage produces an artifact you'll show to a stakeholder skeptical that the result is real.
- Stage 1: corpus snapshot. Apply the TLH algorithm to a fixed snapshot of the test corpus and record the harvest schedule produced. The harvest schedule is the algorithm's recommendation, not yet validated against constraints.
- Stage 2: constraint check. Run each recommended harvest through the wash-sale, holding-period, and QSBS-disqualification checks. Filter out the recommendations that violate constraints. The remaining schedule is the actual harvest set the algorithm would execute.
- Stage 3: yield calculation. For each executed harvest, compute the federal-and-state tax savings (capped at $3,000 ordinary-income offset plus realised gains for the year). Sum across the corpus to produce the aggregate TLH yield. Express as basis points against the corpus's total taxable-account assets.
Calibration: why the corpus's deviation rating matters
For backtesting to produce reproducible results, the test corpus must be deterministic. Two backtest runs of the same algorithm against the same corpus snapshot must produce the same harvest schedule, the same constraint filter, and the same yield calculation. This is the rationale for the 'Minimal' deviation rating on the WealthSchema TLH bundle — lots and balances are deterministic given the seed, so regression tests work.
This property is what lets you make claims like 'the new algorithm produces 12% higher TLH yield than the prior version' with statistical confidence. Without determinism, the comparison drowns in noise.
Stress regimes: backtesting beyond the snapshot
The corpus is calibrated to a 'normal' market environment — about 18% of lots show net unrealised losses at snapshot. For backtesting under stress regimes (a 2022-style downturn, a 2008-style crisis), apply a parameterised re-pricing transformation to the corpus before running the backtest. The Methodology PDF includes the methodology for stress-regime re-pricing; the corpus structure preserves the lot-level integrity through the transformation.
// Stress-regime re-pricing pseudocode
function applyStressRegime(corpus, regimeName) {
const factors = STRESS_FACTORS[regimeName];
return corpus.map(household => ({
...household,
accounts: {
...household.accounts,
taxable: household.accounts.taxable.map(account => ({
...account,
lots: account.lots.map(lot => ({
...lot,
current_price: lot.current_price *
factors[lot.asset_class] *
(1 + lot.beta * factors.market_excess)
}))
}))
}
}));
}Reporting: what the backtest result actually says
The backtest output isn't 'the algorithm produces X bps of TLH yield.' It's 'the algorithm produces X bps of TLH yield against Y corpus under Z assumptions.' The corpus citation matters because the answer changes meaningfully across corpora — a corpus weighted toward ETF-heavy holdings produces lower TLH yield than a corpus weighted toward single-stock holdings, even with the same algorithm. The assumption citation matters because tax-rate assumptions, state-tax assumptions, and short-vs-long-term gain composition all affect the dollar yield even with the same harvest schedule.
Report the backtest with the full provenance: corpus version (e.g. 'WealthSchema TLH bundle, 2026 annual refresh, 350 households'), constraint set (e.g. 'wash-sale within 30 days across all accounts in household graph'), and assumption set (e.g. 'federal marginal rate 37%, state marginal rate 9.3% for CA residents, 0% for FL residents'). The provenance is what makes the backtest defensible to investment committees, regulators, and customers.
Key takeaways
- TLH backtests fail when the test fixture lacks lot-level structure, cross-account wash-sale dynamics, or QSBS interactions — three dimensions most synthetic fixtures skip.
- A defensible backtest pipeline has three stages: snapshot run, constraint filter, yield calculation. Each stage produces a reviewable artifact.
- Determinism (the 'Minimal' deviation rating) is what makes regression testing work — same corpus + same algorithm always produces same yield.
- Stress-regime backtesting works by parameterised re-pricing applied to the corpus snapshot, preserving lot-level integrity through the transformation.
- Backtest results should always cite corpus version, constraint set, and assumption set — the provenance is what makes the result defensible.
FAQ
What corpus size is sufficient for TLH backtesting?+
350 households (the size of the WealthSchema TLH bundle) is sufficient for both algorithm validation and aggregate-yield estimation with reasonable statistical confidence. For specific sub-populations (e.g. backtesting performance specifically on RSU-vesting tech workers), the bundle's archetype mix supports filtering down to ~50-100 households per sub-population, which is enough for coarse comparisons.
Can I backtest direct-indexing strategies that involve customisation?+
Yes. The corpus's structural depth supports backtesting customisation strategies — values-based exclusions, factor tilts, sector preferences. For each customisation, define the deviation from a baseline portfolio and apply the same TLH algorithm to both. The differential is the impact of customisation on TLH yield.
How does the corpus handle different cost-basis methods?+
Specific-ID is dominant in the corpus (~75% of taxable accounts) because it's the method TLH algorithms typically require. FIFO accounts are present (~20%) and average-cost accounts are represented (~5%) — testing your algorithm's handling of multiple methods is supported.
What about corporate actions (splits, mergers, spin-offs)?+
The corpus snapshot captures the post-corporate-action state of each lot. For backtesting through a corporate-action event, you'd apply the appropriate transformation (1-for-2 split → halve share count, double cost basis per share) to the relevant lots before running the algorithm. The Methodology PDF includes a corporate-action playbook.
Can I export the corpus to my data warehouse for analytical work?+
Yes. The Parquet format is recommended for analytical work — it's columnar and compresses the lot-level data efficiently. The CSV format works for SQL-warehouse ingestion in long-format. JSON nests the lots inside each household and is best for record-by-record processing.
How does the corpus's QSBS handling work with my algorithm?+
QSBS-eligible positions carry a structured `qsbs_attestation` field with the Section 1202 attestation chain. Your TLH algorithm can read this field and avoid harvesting against the QSBS-eligible position before the 5-year mark vests. The field's structure is documented in the Methodology PDF.
Are there households where TLH math goes negative (more harm than benefit)?+
Yes — the corpus includes households where naive TLH would produce negative yield (e.g. harvesting a position that's underwater but where the wash-sale conflict would defer the loss recognition past expiration). About 8% of the corpus has at least one such structural trap, useful for testing your algorithm's ability to avoid these cases.