wealthschema/data sets/cash-flow-stress-test-dataset
All Data Sets

Cash Flow Stress Test Dataset

Most cash-flow planning tools demo well on a textbook salaried family with stable monthly income and predictable expenses. They fall over the moment they encounter the household types that actually need cash-flow planning — the freelancer with 6× income volatility, the single parent with a quarterly childcare bill, the gig worker whose Q1 is half of Q3. The Cash Flow Stress Test Dataset is 270 synthetic households built specifically for those messy cases, with 96 monthly snapshots each so your engine sees a full eight-year cash-flow trajectory through real shocks: job loss, medical emergency, expense spikes, divorce, distressed mortgages, and recovery.

Households
270
Archetypes
10
Formats
JSON, CSV, Parquet
Deviation
Minimal

Why this Data Set exists

If you're scoring liquidity adequacy or modeling income shocks, the populations that produce the most useful test signals are also the populations underrepresented in any sanitized prod dataset. Salaried W-2 workers dominate book-of-business data; variable-income earners, distressed borrowers, and post-divorce rebuilders are the long tail. Your model never gets exercised on them.

The second problem is temporal granularity. A point-in-time household snapshot can't stress-test a cash-flow engine — you need months of history with realistic seasonal patterns, irregular bills, and shock events that propagate forward through savings depletion, credit draw-down, and expense compression. Most synthetic data products give you the snapshot and stop there.

This Data Set solves both. The 270-household population is weighted toward variable-income, low-liquidity, and shock-exposed cohorts. Every household carries 96 monthly snapshots — the same longitudinal contract used by the WealthSynth Master Corpus — so your engine sees full trajectories, not just balance-sheet stills.

Use Cases

Cash-flow planning engine testing
Emergency fund adequacy scoring
Income shock scenario modeling
Liquidity & default risk assessment

Who uses this Data Set

Cash-Flow Planning Engine Engineer

Validates that the planning engine's monthly projection logic handles seasonal income, irregular expenses, and shock events without silently smoothing them away. The 96-month trajectories let regression tests catch numerical drift across long horizons.

Emergency Fund Adequacy Modeler

Trains and validates 'months-of-expenses' liquidity scoring against households whose actual liquidity buffer was tested by a real shock event in the historical data, instead of deriving expected behaviour from a steady-state assumption.

Default Risk / Underwriting Data Scientist

Uses pre-labeled income-shock and liquidity-stress events as supervised training data for default-risk models, including the rare combinations (variable income + distressed mortgage + thin emergency fund) that drive most actual defaults.

Fintech PM building a budgeting product

Demos the product's value proposition to investors and prospects using households that look like the target market — gig workers, single parents, freelancers — without using real customer data, eliminating the 'we need to wait for production data' chicken-and-egg.

Compliance Analyst at a Lender

Tests the firm's ability-to-repay assessment process against household types regulators have flagged in fair-lending reviews, ensuring variable-income borrowers aren't systematically scored worse than their actual repayment capacity warrants.

What's inside

Each of the 270 households is drawn from one of ten cash-flow-relevant archetypes spanning gig workers, single parents, distressed mortgages, post-divorce rebuilders, and artists with royalty income. The mix is intentional — 40% variable-income, 25% currently in financial stress (stress flags pre-set), 20% underwater on a major liability, and 15% transitioning out of a recent shock. The blended population is calibrated to surface the cash-flow patterns that drive real liquidity events.

Every household carries the full 96-month longitudinal track that's standard across the WealthSynth corpus: monthly net cash flow, savings rate, account balances, credit utilization, and an event log of shocks (job loss, medical, expense spike, income drop). Stress scenarios are pre-labeled — you can filter for the post-divorce rebuild trajectories, the variable-income recovery curves after gig-platform deactivation, or the medical-debt crisis populations specifically. Income volatility percentage is computed against the household's own 96-month series so it's calibrated, not assumed.

The Data Set ships as JSON (one file per household with embedded longitudinal array, plus a manifest), CSV (long-format with one row per household-month so it's join-friendly with your warehouse), and Parquet (columnar; recommended for analytical queries over the 25,920-row month grid). The WealthSynth Methodology PDF documents the longitudinal generation methodology, the shock-event taxonomy, and the field-by-field derivation of stress scenarios.

Preview a sample household

A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 270 like it) ships in the ZIP.

F-02·Gig Economy Starter
representative archetype household
Household
Single
State
NJ
Gross income (band)
$50k–$100k
Net worth (band)
Dependents
0
Income source types
self employment, w2 bonus
Members (1)
primary
Age 20–24
professional services

Technical Highlights

96 monthly snapshots per household
Pre-built income shock scenarios
Seasonal pattern realism
Includes 1099 / variable-income households

Sample Schema Fields

sample_record.json
{
  "longitudinal.monthly[].net_cash_flow": <value>,
  "longitudinal.monthly[].savings_rate": <value>,
  "stress.scenarios[]": <value>,
  "liquidity.months_of_expenses": <value>,
  "cash_flow.income_volatility_pct": <value>
}

Sample queries

Find variable-income households with thin emergency funds

Surfaces households whose income volatility is above 30% AND who have less than two months of expenses in liquid reserves — the canonical 'one bad month from default' profile.

households.filter(h =>
  h.cash_flow.income_volatility_pct > 0.30 &&
  h.liquidity.months_of_expenses < 2
)
Identify post-shock recovery trajectories

Returns households where a labeled shock event (job loss, medical, divorce) occurred in the first 24 months of the longitudinal series, so you can see the recovery curve in months 25–96.

households.filter(h =>
  h.events.life_events.some(e =>
    e.month <= 24 &&
    ['job_loss', 'medical', 'divorce'].includes(e.type))
)
Surface seasonal income patterns

Computes the coefficient of variation of monthly income for each household, then returns the top decile — useful for testing your engine's handling of artists, royalty earners, and seasonal-business owners.

households
  .map(h => ({
    h,
    cv: stddev(h.longitudinal.monthly.map(m => m.income)) /
        mean(h.longitudinal.monthly.map(m => m.income))
  }))
  .filter(({cv}) => cv > 0.4)
Detect savings-depletion → credit-draw transitions

Returns months where liquid savings dropped below one month of expenses AND credit-card balance increased — the inflection point where a household pivots from saving to borrowing.

households.flatMap(h =>
  h.longitudinal.monthly.filter((m, i, arr) =>
    m.savings_balance < m.monthly_expenses &&
    i > 0 &&
    m.credit_card_balance > arr[i-1].credit_card_balance)
)

Methodology

The 270-household population is generated from the WealthSynth pipeline against ten cash-flow-stress-relevant archetypes. Each household begins with a baseline financial snapshot, then runs through a longitudinal generator that produces 96 monthly entries respecting income seasonality, expense irregularity, and shock-event triggers. Income volatility is sampled from archetype-specific distributions (gig workers ~40% CV, salaried W-2 ~5% CV) so the realism is calibrated rather than uniform. Shock events are seeded at probabilistic intervals consistent with each archetype's empirically observed life-event base rates. Every record passes the WealthSynth consistency validator (monthly cash flow reconciles with balance-sheet deltas; shock events propagate forward through savings, credit, and behavioral fields) and the LLM-as-judge quality gate. Annual refresh re-runs against current minimum wage, gig-platform fee schedules, and unemployment benefit levels.

Included Archetypes (10)

Frequently asked questions

How is income volatility calculated in this Data Set?+

Each household's `cash_flow.income_volatility_pct` is computed as the coefficient of variation (standard deviation ÷ mean) of monthly gross income over the household's 96-month longitudinal series — calibrated against the household's own history, not assumed from archetype defaults. This matches the SEC's preferred methodology in fair-lending reviews.

What counts as a 'shock event' in the labeled events?+

Six categories are pre-labeled: job_loss, medical (defined as a single-month medical expense exceeding 50% of monthly income), divorce, expense_spike (a single-month expense exceeding 200% of trailing 12-month average), income_drop (a 6-month moving average drop greater than 20%), and family_caregiver_onset. Other life events (relocation, marriage, child) are also tracked but not classified as 'shocks' — they're in `events.life_events` with their own taxonomy.

Can I use this for a credit-decisioning model?+

Yes — that's a primary use case. The Data License explicitly permits training and validation of credit and underwriting models. Note that the synthetic population is designed to over-represent variable-income and shock-exposed households, so model performance metrics computed on this Data Set should not be extrapolated to a general lending portfolio without re-weighting.

Does the longitudinal data start from a fixed calendar date?+

No. Each household's 96-month series is anchored to a relative month_0 rather than a calendar date, so the corpus doesn't bake in COVID-era assumptions or any specific economic regime. Refreshed corpus versions retain this design — calendar-anchored data is available on request for backtesting against specific historical periods.

How does this differ from B13 (Mortgage Stress Test)?+

B13 focuses on mortgage-specific stress (DTI, LTV, modification eligibility, forbearance) on a smaller 90-household corpus. B04 is broader — household-level cash flow across multiple liability types — and uses the longitudinal contract to expose engines to multi-month dynamics. Many buyers purchase both; B04 first to validate cash-flow logic, B13 to focus on mortgage decisioning specifically.

What's the right format for warehouse ingestion?+

For analytical queries spanning the full 25,920-row month grid (270 households × 96 months), Parquet is recommended — it's columnar and compresses the longitudinal time series efficiently. CSV is also long-format (one row per household-month) and works well in Snowflake or BigQuery. JSON nests the longitudinal array inside each household record, which is best for record-by-record processing but worse for analytical scans.

How often do shock events occur in the corpus?+

Approximately 70% of households experience at least one labeled shock event in the 96-month window; about 30% experience two or more. This is intentionally elevated relative to a general population (where multi-shock households are rarer) because the Data Set is designed to stress-test handling of these events. Shock-frequency calibration is documented in the Methodology PDF.

Are there households with NO shocks?+

Yes — about 30% of the corpus. These provide the negative-class examples needed for shock-detection models and serve as a control group for behavioural comparisons. Filter on `events.life_events.length === 0` to retrieve them.

Related Wealth Data Sets

$4,000
one-time purchase
270 households (ZIP)
Methodology PDF
JSON, CSV, Parquet formats
Account required to purchase

Purchases are for internal use only. Redistribution or resale of data is prohibited under the WealthSchema Data License.

View data license →