Tax-loss harvesting algorithms look elegant on a whiteboard. They turn into a maze the moment they meet real taxable accounts, where lots have specific acquisition dates, multiple cost-basis methods are in play, wash-sale windows span account boundaries, and QSBS attestations interact with non-grantor trust structures in non-obvious ways. The Tax-Loss Harvesting Simulator is 350 households built for the engineers writing the algorithm and the platforms productising it: lot-level cost basis with specific-ID tracking, cross-account wash-sale flagging, and the QSBS Section 1202 attestation chain that direct-indexing platforms increasingly need to support.
Direct indexing has gone from boutique service to default offering at every major wealth platform in three years. The algorithms underneath are not commodities. The differences in TLH yield from one implementation to another can compound to 100+ basis points of after-tax return — the difference between a competitive product and an also-ran. But validating the algorithms requires test data with structural properties most synthetic data products skip: lot-level cost basis with realistic acquisition-date distributions, holdings that mathematically produce wash-sale conflicts at the right frequency, QSBS-eligible positions with actual gross-asset-test history, and household-level coordination across taxable and tax-deferred accounts.
Without this kind of fixture data, TLH algorithm development happens in a vacuum. You can't backtest. You can't reproduce a customer's reported issue. You can't run regression tests against a known-good baseline. The Tax-Loss Harvesting Simulator is the canonical fixture for this work.
Backtests TLH algorithms against 350 households with realistic lot-level cost basis, validating that the algorithm's wash-sale avoidance, holding-period optimization, and rebalancing trade-off logic produce expected after-tax return improvements vs. a no-TLH baseline.
Demos the platform's TLH capability to advisor and RIA prospects using a corpus that produces realistic harvest opportunities — neither artificially over-loaded with losses (unrealistic yield) nor under-loaded (boring demo).
Researches TLH yield sensitivity to portfolio characteristics (turnover, factor tilts, fund vs. stock holdings, household-level rebalancing constraints) using a corpus structured for parametric study rather than ad-hoc anecdote.
Validates wash-sale rule compliance across the household's full account graph — taxable, IRA, spouse's accounts — ensuring the firm's automated trading systems don't generate disallowed losses that surface only at year-end on the 1099.
Tests the firm's QSBS tracking and Section 1202 attestation workflow against households whose founder-stock positions span multiple trusts, demonstrating to clients that the firm can handle the complexity their accountants currently miss.
The 350 households span ten archetypes selected for taxable-account complexity: tech workers with vesting RSUs, dual-income professionals with brokerage accounts, peak earners with concentrated positions, HNW investors with multi-firm engagements, FIRE achievers with tax-aware withdrawal sequencing, real estate investors, and crypto-heavy investors whose digital-asset cost basis interacts with traditional brokerage holdings.
Every household has lot-level taxable-account holdings: each lot carries its acquisition date, original cost basis, current market value, holding-period status (short-term vs. long-term), and unrealized gain/loss. Cost-basis method is recorded per lot (specific-ID is the dominant choice in this corpus, since it's the method required for sophisticated TLH). Wash-sale flags pre-populate where applicable — both intra-account and cross-account (the disallowed-loss case where a buyer triggers wash-sale by repurchasing in their IRA). Holding-period bucketing is structured for fast filter access. QSBS-eligible positions include the Section 1202 attestation chain: original issuance date, gross-asset-test history at issuance, holding-period satisfaction, and any stacking via non-grantor trusts.
The Data Set ships as JSON, CSV, and Parquet (Parquet is recommended for analytical work over the lot-level data — there are roughly 8,000 lots across the corpus, so Parquet's columnar layout makes ad-hoc queries fast). The WealthSynth Methodology PDF documents the lot-generation methodology, the calibration of unrealized-loss frequency against realistic market conditions, and the specific structural patterns each archetype is designed to exercise.
A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 350 like it) ships in the ZIP.
{
"accounts.taxable.lots[].acquisition_date": <value>,
"accounts.taxable.lots[].cost_basis": <value>,
"accounts.taxable.lots[].unrealized_pnl": <value>,
"taxes.wash_sale_flags": <value>,
"taxes.qsbs_attestation": <value>
}Returns lots with negative unrealized P&L AND no purchase of the same security (or substantially identical) in the household's account graph within the 30-day wash-sale window — the canonical TLH eligibility check.
households.flatMap(h =>
h.accounts.taxable.lots.filter(lot =>
lot.unrealized_pnl < 0 &&
!h.taxes.wash_sale_flags.some(wf =>
wf.symbol === lot.symbol &&
Math.abs(wf.window_days) <= 30)
)
)Returns Section 1202 QSBS-eligible positions whose holding period is between 4 and 5 years — the window where holding-period planning matters most because the QSBS exclusion vests at year 5.
households.flatMap(h =>
h.equity_comp.qsbs_attestation
.filter(p => p.years_held >= 4 && p.years_held < 5)
)Returns lots whose holding period crosses 365 days within the next 60 days — a planning window where deferring a sale converts short-term gain to long-term gain at significant tax-rate benefit.
households.flatMap(h =>
h.accounts.taxable.lots.filter(lot => {
const days = daysBetween(lot.acquisition_date, today());
return days >= 305 && days < 365 && lot.unrealized_pnl > 0;
})
)Returns the sum of unrealized losses available for harvest in each household, capped at the $3,000 annual ordinary-income offset plus realized gains for the year — the actual TLH yield available, not theoretical maximum.
households.map(h => {
const harvestable = h.accounts.taxable.lots
.filter(l => l.unrealized_pnl < 0 &&
!isWashSaleConflicted(l, h))
.reduce((s, l) => s + Math.abs(l.unrealized_pnl), 0);
const realizedGains = h.taxes.realized_gains_ytd;
const cap = realizedGains + 3000;
return { id: h.id, tlh_yield: Math.min(harvestable, cap) };
})Tax-loss-harvesting algorithms break on edge cases, not happy-path lots. The corpus is built to exercise the specific structural traps that show up in production — the ones that surface only at year-end on the 1099 and that engineering teams report as urgent bugs in January. Below is the enumeration: every case is grep-able in the household record at the field path shown, and the corpus-rate column reports the calibrated frequency across the 350 households.
A taxable account sells a security at a loss; within the 30-day window the household's IRA, the spouse's IRA, or a joint brokerage account buys the same security. The loss is disallowed but few platforms detect it cross-account because the trades clear at different custodians. The corpus seeds both same-custodian and cross-custodian cases so detection logic gets tested against the harder one.
taxes.wash_sale_flags[]The household sells one S&P 500 ETF at a loss and buys a different-ticker S&P 500 ETF the same day, expecting to dodge the wash-sale rule on a literal-symbol match. The IRS substantially-identical test catches it; naive symbol-match logic in the harvester does not. The corpus includes ETF→ETF, mutual-fund→ETF, and ETF→index-future swap cases.
taxes.wash_sale_flags[].substantially_identicalTLH algorithms typically require specific-ID cost basis to select the highest-basis lot for the sale. The corpus records the elected cost-basis method per account; algorithm regression tests should exercise all three modes and confirm the algorithm degrades gracefully (or errors loudly) on non-specific-ID accounts.
accounts.taxable[].cost_basis_methodA lot's holding period crosses 365 days and the gain/loss tax treatment changes overnight. Lots in the 305–365-day band are the planning window where deferring a sale can materially shift after-tax outcomes. The corpus distributes acquisition dates to land a meaningful share of lots inside the boundary window so the algorithm's timing logic actually gets tested.
accounts.taxable[].lots[].acquisition_dateQSBS-eligible founder stock carries an attestation chain — original issuance date, gross-asset-test history at issuance, and (for stacked positions) the trust transfer events that multiplied the $10M-or-10x exclusion. Direct-indexing platforms are increasingly asked to track this; the corpus includes the full chain so the tracking logic can be validated end-to-end.
equity_comp.qsbs_attestation[]A founder gifts QSBS-eligible stock into one or more non-grantor trusts to multiply the Section 1202 exclusion. Each trust is its own taxpayer; the attestation chain has to record the original issuance, the transfer event, and the trust's grantor status. The corpus includes single-trust and multi-trust stacking cases — the structure tax advisors actually use.
equity_comp.qsbs_attestation[].stacking_trustsThe Section 1202 exclusion vests at year 5. Positions in the 4-to-5-year holding-period band are the planning window where holding to the 5-year cliff matters most. The corpus seeds positions across the full holding-period range so the planning surface gets exercised, not just the trivially-eligible (>5 years) case.
equity_comp.qsbs_attestation[].years_heldThe wash-sale rule does not currently apply to crypto, though pending legislation would change that. Households with both traditional brokerage and crypto positions need the boundary handled correctly today and need the logic ready to flip on a regulatory change. The corpus tags dual-asset households so cross-asset analyses are tractable.
assets.crypto[] + accounts.taxable[]Buying call options on a stock the household just sold at a loss can trigger wash-sale on the underlying stock — a case detection logic frequently misses because the symbol-level join doesn't catch the derivative relationship. The corpus seeds these cases (calls, puts, LEAPS) so the harder cross-instrument check gets exercised.
accounts.taxable[].option_positions[]The $3,000 annual cap on ordinary-income offset interacts with realized capital gains: harvested losses first offset realized gains for the year (no cap), then offset up to $3,000 of ordinary income, then carry forward. Algorithms that compute TLH yield without modeling the cap-and-carryforward shape over-promise yield to the user. The corpus tracks realized gains YTD so the cap logic can be validated.
taxes.realized_gains_ytdFrequencies are calibrated against a 'normal' equity-market environment, not a stress year. The Methodology PDF documents the parameterized re-pricing approach for shifting the corpus into a stress regime (2008, 2022, etc.) and the corresponding edge-case frequency shifts. For each row, the field path is grep-able in the household JSON; regression-test fixtures should exercise every case at least once.
Each household's taxable-account holdings are generated against archetype-specific portfolio profiles (RSU-heavy tech workers carry concentrated single-stock positions; HNW investors carry diversified multi-fund portfolios; crypto-heavy households carry significant digital-asset cost-basis structures). Lot acquisition dates are sampled to produce realistic holding-period distributions, including the tail of long-held positions with embedded gains. Unrealized loss frequency is calibrated to a 'normal' market environment — about 18% of lots show net unrealized losses at the corpus snapshot date, in line with typical equity volatility. Wash-sale conflicts are seeded at archetype-realistic rates: dual-account households (taxable + IRA at same firm) produce more cross-account conflicts than single-account households. QSBS attestation chains are generated with full Section 1202 traceability — original issuance, gross-asset-test history, and any stacking trust structures. The corpus passes the WealthSynth consistency validator (cost basis reconciles, wash-sale flags are mathematically consistent with the lot graph, QSBS attestation is internally complete) and the LLM-as-judge gate. Annual refresh re-runs against current-year market levels and any updated tax-rate parameters.
About 18% of lots in the corpus show net unrealized losses at snapshot — calibrated to a 'normal' equity-market environment. The corpus is not designed to replicate a specific historical year (e.g. 2008 or 2022); it's a steady-state corpus for algorithmic backtesting. For market-stress backtesting, the Methodology PDF includes a parameterised re-pricing methodology you can apply to shift the corpus to a stress regime.
Yes — the corpus includes the meaningful cases: spouse's IRA holding the same security a taxable account just sold at a loss; substantially-identical fund swaps that fail the IRS test; option positions that trigger wash-sale on the underlying. About 6% of would-be-harvestable lots in the corpus are blocked by cross-account wash-sale conflicts, in line with empirical broker data.
Specific-ID is dominant (~75% of taxable accounts) because that's the method TLH algorithms typically require. FIFO accounts are present (~20%) and average-cost accounts (mostly mutual-fund-only households) are represented (~5%). The household record indicates the cost-basis method on each account.
Yes — about 35% of QSBS-eligible positions in the corpus are stacked across non-grantor trusts to multiply the $10M-or-10x exclusion. The structured attestation chain includes the trust's grantor status, the trust's original-issuance date, and the gift / sale event that transferred the QSBS-eligible stock into the trust. This matches the planning structure tax advisors actually use.
Crypto positions are present but live in `assets.crypto.*` rather than mixed into the brokerage taxable-account schema, since the wash-sale rule does not currently apply to crypto (noting that this could change). Households with both traditional brokerage and crypto are tagged so you can run cross-asset analyses.
Parquet is recommended. The corpus has roughly 8,000 lots; analytical queries (filter to losses, group by holding period, aggregate by symbol) run in milliseconds against Parquet versus seconds against JSON. CSV is provided for SQL-warehouse ingestion (long-format with one row per lot).
Yes — the corpus is intentionally minimal-deviation (lots and balances are deterministic given the seed), so the same TLH algorithm run against the same corpus snapshot produces the same harvest schedule. This makes it suitable for regression testing across algorithm changes. The deviation rating ('Minimal') reflects this design intent.
B16 focuses on the equity-comp side: RSU vesting calendars, ISO/NQSO exercise mechanics, AMT exposure, 83(b) elections. B02 focuses on the post-vesting taxable-account view — once shares are in the brokerage account, what does the lot graph look like for TLH purposes? Tax-tech buyers often purchase both: B16 for the comp-side, B02 for the harvesting-side.
150 households with detailed equity compensation: RSU vesting calendars, ISO/NQSO grants, ESPP with lookback, 83(b) elections, AMT exposure, and exercise window expirations. Each grant has a structured grant_type, vesting schedule, and vested-to-date calculation.
60 founder and early-employee households with QSBS Section 1202 attestation chains, 5-year holding period satisfaction, gross-asset tests at issuance, and stacking via non-grantor trusts. Includes IPO and secondary-sale liquidity events.
320 households with multi-state tax exposure: HCOL-to-LCOL relocations, dual residency, MA millionaires tax, NY/CA convenience-of-employer rules, and digital nomads with no fixed domicile. Each carries source-state allocation history and residency change events.
Purchases are for internal use only. Redistribution or resale of data is prohibited under the WealthSchema Data License.
View data license →