wealthschema/data sets/tax-loss-harvesting-simulator
All Data Sets
B02Tax PlanningMost Popular

Tax-Loss Harvesting Simulator

Tax-loss harvesting algorithms look elegant on a whiteboard. They turn into a maze the moment they meet real taxable accounts, where lots have specific acquisition dates, multiple cost-basis methods are in play, wash-sale windows span account boundaries, and QSBS attestations interact with non-grantor trust structures in non-obvious ways. The Tax-Loss Harvesting Simulator is 350 households built for the engineers writing the algorithm and the platforms productising it: lot-level cost basis with specific-ID tracking, cross-account wash-sale flagging, and the QSBS Section 1202 attestation chain that direct-indexing platforms increasingly need to support.

Households
350
Archetypes
10
Formats
JSON, CSV, Parquet
Deviation
Minimal

Why this Data Set exists

Direct indexing has gone from boutique service to default offering at every major wealth platform in three years. The algorithms underneath are not commodities. The differences in TLH yield from one implementation to another can compound to 100+ basis points of after-tax return — the difference between a competitive product and an also-ran. But validating the algorithms requires test data with structural properties most synthetic data products skip: lot-level cost basis with realistic acquisition-date distributions, holdings that mathematically produce wash-sale conflicts at the right frequency, QSBS-eligible positions with actual gross-asset-test history, and household-level coordination across taxable and tax-deferred accounts.

Without this kind of fixture data, TLH algorithm development happens in a vacuum. You can't backtest. You can't reproduce a customer's reported issue. You can't run regression tests against a known-good baseline. The Tax-Loss Harvesting Simulator is the canonical fixture for this work.

Use Cases

Tax-loss harvesting algorithm backtesting
Direct-indexing platform validation
Wash-sale rule compliance
After-tax return optimization

Who uses this Data Set

Direct-Indexing Platform Engineer

Backtests TLH algorithms against 350 households with realistic lot-level cost basis, validating that the algorithm's wash-sale avoidance, holding-period optimization, and rebalancing trade-off logic produce expected after-tax return improvements vs. a no-TLH baseline.

Tax-Optimization SaaS Builder

Demos the platform's TLH capability to advisor and RIA prospects using a corpus that produces realistic harvest opportunities — neither artificially over-loaded with losses (unrealistic yield) nor under-loaded (boring demo).

Quantitative Researcher at a Robo-Advisor

Researches TLH yield sensitivity to portfolio characteristics (turnover, factor tilts, fund vs. stock holdings, household-level rebalancing constraints) using a corpus structured for parametric study rather than ad-hoc anecdote.

RegTech Engineer

Validates wash-sale rule compliance across the household's full account graph — taxable, IRA, spouse's accounts — ensuring the firm's automated trading systems don't generate disallowed losses that surface only at year-end on the 1099.

Founder Advisor with QSBS-Heavy Clients

Tests the firm's QSBS tracking and Section 1202 attestation workflow against households whose founder-stock positions span multiple trusts, demonstrating to clients that the firm can handle the complexity their accountants currently miss.

What's inside

The 350 households span ten archetypes selected for taxable-account complexity: tech workers with vesting RSUs, dual-income professionals with brokerage accounts, peak earners with concentrated positions, HNW investors with multi-firm engagements, FIRE achievers with tax-aware withdrawal sequencing, real estate investors, and crypto-heavy investors whose digital-asset cost basis interacts with traditional brokerage holdings.

Every household has lot-level taxable-account holdings: each lot carries its acquisition date, original cost basis, current market value, holding-period status (short-term vs. long-term), and unrealized gain/loss. Cost-basis method is recorded per lot (specific-ID is the dominant choice in this corpus, since it's the method required for sophisticated TLH). Wash-sale flags pre-populate where applicable — both intra-account and cross-account (the disallowed-loss case where a buyer triggers wash-sale by repurchasing in their IRA). Holding-period bucketing is structured for fast filter access. QSBS-eligible positions include the Section 1202 attestation chain: original issuance date, gross-asset-test history at issuance, holding-period satisfaction, and any stacking via non-grantor trusts.

The Data Set ships as JSON, CSV, and Parquet (Parquet is recommended for analytical work over the lot-level data — there are roughly 8,000 lots across the corpus, so Parquet's columnar layout makes ad-hoc queries fast). The WealthSynth Methodology PDF documents the lot-generation methodology, the calibration of unrealized-loss frequency against realistic market conditions, and the specific structural patterns each archetype is designed to exercise.

Preview a sample household

A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 350 like it) ships in the ZIP.

A-01·Young Family — First Home
representative archetype household
Household
Married Joint
State
FL
Gross income (band)
$100k–$200k
Net worth (band)
Dependents
1
Income source types
w2 salary, w2 bonus
Members (3)
primary
Age 35–39
healthcare
spouse
Age 40–44
healthcare
dependent
Age 10–14

Technical Highlights

Lot-level cost basis (specific-ID method)
Cross-account wash-sale window flags
QSBS Section 1202 attestation chain
Minimal deviation — deterministic backtests

Sample Schema Fields

sample_record.json
{
  "accounts.taxable.lots[].acquisition_date": <value>,
  "accounts.taxable.lots[].cost_basis": <value>,
  "accounts.taxable.lots[].unrealized_pnl": <value>,
  "taxes.wash_sale_flags": <value>,
  "taxes.qsbs_attestation": <value>
}

Sample queries

Find harvestable losses with no wash-sale conflict

Returns lots with negative unrealized P&L AND no purchase of the same security (or substantially identical) in the household's account graph within the 30-day wash-sale window — the canonical TLH eligibility check.

households.flatMap(h =>
  h.accounts.taxable.lots.filter(lot =>
    lot.unrealized_pnl < 0 &&
    !h.taxes.wash_sale_flags.some(wf =>
      wf.symbol === lot.symbol &&
      Math.abs(wf.window_days) <= 30)
  )
)
Identify QSBS positions approaching 5-year holding

Returns Section 1202 QSBS-eligible positions whose holding period is between 4 and 5 years — the window where holding-period planning matters most because the QSBS exclusion vests at year 5.

households.flatMap(h =>
  h.equity_comp.qsbs_attestation
    .filter(p => p.years_held >= 4 && p.years_held < 5)
)
Surface short-term to long-term transition opportunities

Returns lots whose holding period crosses 365 days within the next 60 days — a planning window where deferring a sale converts short-term gain to long-term gain at significant tax-rate benefit.

households.flatMap(h =>
  h.accounts.taxable.lots.filter(lot => {
    const days = daysBetween(lot.acquisition_date, today());
    return days >= 305 && days < 365 && lot.unrealized_pnl > 0;
  })
)
Compute estimated TLH yield for a household

Returns the sum of unrealized losses available for harvest in each household, capped at the $3,000 annual ordinary-income offset plus realized gains for the year — the actual TLH yield available, not theoretical maximum.

households.map(h => {
  const harvestable = h.accounts.taxable.lots
    .filter(l => l.unrealized_pnl < 0 &&
                 !isWashSaleConflicted(l, h))
    .reduce((s, l) => s + Math.abs(l.unrealized_pnl), 0);
  const realizedGains = h.taxes.realized_gains_ytd;
  const cap = realizedGains + 3000;
  return { id: h.id, tlh_yield: Math.min(harvestable, cap) };
})

Edge cases modeled

Tax-loss-harvesting algorithms break on edge cases, not happy-path lots. The corpus is built to exercise the specific structural traps that show up in production — the ones that surface only at year-end on the 1099 and that engineering teams report as urgent bugs in January. Below is the enumeration: every case is grep-able in the household record at the field path shown, and the corpus-rate column reports the calibrated frequency across the 350 households.

Edge case
Cross-account wash-sale conflict

A taxable account sells a security at a loss; within the 30-day window the household's IRA, the spouse's IRA, or a joint brokerage account buys the same security. The loss is disallowed but few platforms detect it cross-account because the trades clear at different custodians. The corpus seeds both same-custodian and cross-custodian cases so detection logic gets tested against the harder one.

taxes.wash_sale_flags[]
~6% of harvestable lots
Substantially-identical fund swap

The household sells one S&P 500 ETF at a loss and buys a different-ticker S&P 500 ETF the same day, expecting to dodge the wash-sale rule on a literal-symbol match. The IRS substantially-identical test catches it; naive symbol-match logic in the harvester does not. The corpus includes ETF→ETF, mutual-fund→ETF, and ETF→index-future swap cases.

taxes.wash_sale_flags[].substantially_identical
~2% of would-be harvests
Specific-ID lot election (vs FIFO / avg-cost)

TLH algorithms typically require specific-ID cost basis to select the highest-basis lot for the sale. The corpus records the elected cost-basis method per account; algorithm regression tests should exercise all three modes and confirm the algorithm degrades gracefully (or errors loudly) on non-specific-ID accounts.

accounts.taxable[].cost_basis_method
~75% specific-ID / 20% FIFO / 5% avg-cost
Short-term to long-term holding-period boundary

A lot's holding period crosses 365 days and the gain/loss tax treatment changes overnight. Lots in the 305–365-day band are the planning window where deferring a sale can materially shift after-tax outcomes. The corpus distributes acquisition dates to land a meaningful share of lots inside the boundary window so the algorithm's timing logic actually gets tested.

accounts.taxable[].lots[].acquisition_date
~8% of lots within 60 days of crossing
QSBS Section 1202 attestation chain

QSBS-eligible founder stock carries an attestation chain — original issuance date, gross-asset-test history at issuance, and (for stacked positions) the trust transfer events that multiplied the $10M-or-10x exclusion. Direct-indexing platforms are increasingly asked to track this; the corpus includes the full chain so the tracking logic can be validated end-to-end.

equity_comp.qsbs_attestation[]
QSBS present in 22% of households
QSBS stacking across non-grantor trusts

A founder gifts QSBS-eligible stock into one or more non-grantor trusts to multiply the Section 1202 exclusion. Each trust is its own taxpayer; the attestation chain has to record the original issuance, the transfer event, and the trust's grantor status. The corpus includes single-trust and multi-trust stacking cases — the structure tax advisors actually use.

equity_comp.qsbs_attestation[].stacking_trusts
~35% of QSBS positions are stacked
QSBS approaching the 5-year vest

The Section 1202 exclusion vests at year 5. Positions in the 4-to-5-year holding-period band are the planning window where holding to the 5-year cliff matters most. The corpus seeds positions across the full holding-period range so the planning surface gets exercised, not just the trivially-eligible (>5 years) case.

equity_comp.qsbs_attestation[].years_held
~18% of QSBS positions in years 4–5
Crypto-vs-brokerage wash-sale boundary

The wash-sale rule does not currently apply to crypto, though pending legislation would change that. Households with both traditional brokerage and crypto positions need the boundary handled correctly today and need the logic ready to flip on a regulatory change. The corpus tags dual-asset households so cross-asset analyses are tractable.

assets.crypto[] + accounts.taxable[]
~14% of households hold both
Option-position wash-sale on the underlying

Buying call options on a stock the household just sold at a loss can trigger wash-sale on the underlying stock — a case detection logic frequently misses because the symbol-level join doesn't catch the derivative relationship. The corpus seeds these cases (calls, puts, LEAPS) so the harder cross-instrument check gets exercised.

accounts.taxable[].option_positions[]
~3% of would-be harvests blocked
Realized-gains cap on the $3,000 ordinary offset

The $3,000 annual cap on ordinary-income offset interacts with realized capital gains: harvested losses first offset realized gains for the year (no cap), then offset up to $3,000 of ordinary income, then carry forward. Algorithms that compute TLH yield without modeling the cap-and-carryforward shape over-promise yield to the user. The corpus tracks realized gains YTD so the cap logic can be validated.

taxes.realized_gains_ytd
Every household (cap logic universal)

Frequencies are calibrated against a 'normal' equity-market environment, not a stress year. The Methodology PDF documents the parameterized re-pricing approach for shifting the corpus into a stress regime (2008, 2022, etc.) and the corresponding edge-case frequency shifts. For each row, the field path is grep-able in the household JSON; regression-test fixtures should exercise every case at least once.

Methodology

Each household's taxable-account holdings are generated against archetype-specific portfolio profiles (RSU-heavy tech workers carry concentrated single-stock positions; HNW investors carry diversified multi-fund portfolios; crypto-heavy households carry significant digital-asset cost-basis structures). Lot acquisition dates are sampled to produce realistic holding-period distributions, including the tail of long-held positions with embedded gains. Unrealized loss frequency is calibrated to a 'normal' market environment — about 18% of lots show net unrealized losses at the corpus snapshot date, in line with typical equity volatility. Wash-sale conflicts are seeded at archetype-realistic rates: dual-account households (taxable + IRA at same firm) produce more cross-account conflicts than single-account households. QSBS attestation chains are generated with full Section 1202 traceability — original issuance, gross-asset-test history, and any stacking trust structures. The corpus passes the WealthSynth consistency validator (cost basis reconciles, wash-sale flags are mathematically consistent with the lot graph, QSBS attestation is internally complete) and the LLM-as-judge gate. Annual refresh re-runs against current-year market levels and any updated tax-rate parameters.

Included Archetypes (10)

Frequently asked questions

How is unrealized loss frequency calibrated?+

About 18% of lots in the corpus show net unrealized losses at snapshot — calibrated to a 'normal' equity-market environment. The corpus is not designed to replicate a specific historical year (e.g. 2008 or 2022); it's a steady-state corpus for algorithmic backtesting. For market-stress backtesting, the Methodology PDF includes a parameterised re-pricing methodology you can apply to shift the corpus to a stress regime.

Are cross-account wash-sale conflicts realistic?+

Yes — the corpus includes the meaningful cases: spouse's IRA holding the same security a taxable account just sold at a loss; substantially-identical fund swaps that fail the IRS test; option positions that trigger wash-sale on the underlying. About 6% of would-be-harvestable lots in the corpus are blocked by cross-account wash-sale conflicts, in line with empirical broker data.

What cost-basis methods are represented?+

Specific-ID is dominant (~75% of taxable accounts) because that's the method TLH algorithms typically require. FIFO accounts are present (~20%) and average-cost accounts (mostly mutual-fund-only households) are represented (~5%). The household record indicates the cost-basis method on each account.

Does the QSBS attestation chain cover stacking?+

Yes — about 35% of QSBS-eligible positions in the corpus are stacked across non-grantor trusts to multiply the $10M-or-10x exclusion. The structured attestation chain includes the trust's grantor status, the trust's original-issuance date, and the gift / sale event that transferred the QSBS-eligible stock into the trust. This matches the planning structure tax advisors actually use.

Are crypto holdings integrated with traditional brokerage cost basis?+

Crypto positions are present but live in `assets.crypto.*` rather than mixed into the brokerage taxable-account schema, since the wash-sale rule does not currently apply to crypto (noting that this could change). Households with both traditional brokerage and crypto are tagged so you can run cross-asset analyses.

What's the right format for analytical work?+

Parquet is recommended. The corpus has roughly 8,000 lots; analytical queries (filter to losses, group by holding period, aggregate by symbol) run in milliseconds against Parquet versus seconds against JSON. CSV is provided for SQL-warehouse ingestion (long-format with one row per lot).

Does this Data Set work as a regression-test fixture?+

Yes — the corpus is intentionally minimal-deviation (lots and balances are deterministic given the seed), so the same TLH algorithm run against the same corpus snapshot produces the same harvest schedule. This makes it suitable for regression testing across algorithm changes. The deviation rating ('Minimal') reflects this design intent.

How does this differ from B16 (Equity Compensation)?+

B16 focuses on the equity-comp side: RSU vesting calendars, ISO/NQSO exercise mechanics, AMT exposure, 83(b) elections. B02 focuses on the post-vesting taxable-account view — once shares are in the brokerage account, what does the lot graph look like for TLH purposes? Tax-tech buyers often purchase both: B16 for the comp-side, B02 for the harvesting-side.

Related Wealth Data Sets

$4,500
one-time purchase
350 households (ZIP)
Methodology PDF
JSON, CSV, Parquet formats
Account required to purchase

Purchases are for internal use only. Redistribution or resale of data is prohibited under the WealthSchema Data License.

View data license →