wealthschema/data sets/cra-underserved-lending-pack
All Data Sets

CRA / Underserved Lending Pack

Inclusive lending sounds like an aspiration until you sit down to build a model and discover that your historical training data systematically excludes the populations you're trying to serve. The customers who could pay you back if you'd extend credit aren't in the loss-loan tape because you never extended credit to them. The ones who you did approve are the modal prospects who fit the existing scorecard. Every iteration of model training reinforces the same selection bias. The CRA / Underserved Lending Pack is 100 households deliberately built to break that loop — alternative data signals, thin-file profiles, post-bankruptcy rebuilders, and ITIN filers, each with the structural data your model needs to score them on actual repayment capacity rather than on whether they look like your existing customer base.

Households
100
Archetypes
10
Formats
JSON, CSV
Deviation
Moderate

Why this Data Set exists

Community Reinvestment Act compliance is one driver of inclusive-lending investment, but it's not the dominant one. The dominant driver is that the underserved segments are growth markets — first-generation wealth-building families, recent immigrants, gig-economy workers, post-bankruptcy rebuilders. Together they represent tens of millions of households that traditional credit scoring under-prices. The lender who solves the data problem captures the segment.

The data problem is specific. Traditional credit data from the bureaus produces a very thin file (or no file) for these segments — tradelines are short, balances are low, history is sparse. The signal-to-noise ratio is wrong for a logistic-regression credit model. What predicts repayment in these segments is alternative data: rent payment history, utility payment patterns, remittance flow, mobile-phone payment regularity, employer continuity, banking-relationship age. Most lenders don't have any of this in test fixtures, so model development is starved before it starts.

This Data Set provides the alternative data. 100 households across the canonical underserved profiles — thin-file ITIN filers, post-bankruptcy rebuilders, recent immigrants, low-income working families, disabled veterans, cannabis-industry workers — each with the structured alternative-data signals (utility payment history, rent payment history, cash-income estimates, remittance flows) that actually predict repayment.

Use Cases

CRA-aligned lending model training
Thin-credit underwriting alternatives
ITIN-filer mortgage products
Second-chance lending programs

Who uses this Data Set

Underwriting Data Scientist at a CDFI

Trains and validates an alternative-data credit model on a corpus structured for the segment, ensuring the model exhibits acceptable AUC on the populations the institution is mandated to serve, not just on the modal mainstream borrowers.

CRA Compliance Lead at a Bank

Demonstrates to examiners that the bank's lending products are designed and tested specifically for low-to-moderate-income census tracts, with realistic test cases that exercise the inclusive-lending decision logic.

Fintech Builder Targeting Immigrant Households

Validates the product's onboarding and credit logic against ITIN filers and recent immigrants, ensuring KYC paths work and credit decisions don't silently fail for the prospects the product is positioned for.

Fair Lending Audit Analyst

Tests the firm's lending decisions for disparate impact using a labeled test set where protected-class indicators are documented (where consent was given) and the decision logic can be evaluated against ECOA-compliant outputs.

Second-Chance Lending Product Manager

Builds the post-bankruptcy rebuild lending product with realistic 24-month-out-of-discharge profiles, validating that the credit model rewards the rebuild trajectory rather than punishing the historical bankruptcy indefinitely.

What's inside

The 100 households span ten archetypes covering the canonical underserved profiles: F-04 First-Generation Wealth Builders, A-02 Single Parents, S-02 Bankruptcy Recovery, U-01 Unbanked / Recently Banked, U-02 Low-Income Working Families, U-03 Recent Immigrants, B-01 Financial Anxiety / Avoiders, MV-03 Disabled Veterans, N-04 Cannabis-Industry Workers, and X-04 Neurodiverse / Disability Households.

Every household carries the alternative data signals that drive repayment prediction in these segments: 24+ months of rent payment history, utility payment regularity, mobile-phone billing relationships, banking-relationship age (some are recently banked from prepaid-card-only history), remittance outflow patterns where applicable, and cash-income estimation methodology with confidence intervals. Credit data uses thin-file flagging where appropriate; about 35% of the corpus is intentionally thin-file or no-file. Post-bankruptcy households carry a structured rebuild trajectory with re-establishment milestones (secured card, first new tradeline, mortgage re-qualification eligibility).

The Data Set ships as JSON and CSV. The WealthSynth Methodology PDF documents the alternative-data taxonomy, the cash-income estimation methodology, the calibration sources for the underserved-segment profiles (CFPB consumer studies, FDIC unbanked surveys, Pew immigration data), and the specific use cases each archetype is designed to exercise.

Preview a sample household

A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 100 like it) ships in the ZIP.

F-04·First-Generation Wealth Builder
representative archetype household
Household
Single
State
WI
Gross income (band)
<$50k
Net worth (band)
Dependents
0
Income source types
w2 salary, w2 bonus
Members (1)
primary
Age 25–29
technology

Technical Highlights

Alternative data signal taxonomy
ITIN/SSN/no-documentation flags
Post-bankruptcy rebuild trajectory
Cash-income estimation methodology

Sample Schema Fields

sample_record.json
{
  "credit.alternative_data_signals": <value>,
  "credit.thin_file_flag": <value>,
  "demographics.itin_filer": <value>,
  "income.cash_income_estimate": <value>,
  "credit.post_bankruptcy_recovery_score": <value>
}

Sample queries

Identify thin-file prospects with strong alternative signals

Returns prospects whose traditional credit file is thin (fewer than 4 tradelines or no FICO) but whose alternative data shows reliable rent + utility + employer continuity for 24+ months — the highest-conversion underwriting opportunity in the segment.

prospects.filter(p =>
  p.credit.thin_file_flag &&
  p.credit.alternative_data_signals.rent_on_time_24mo >= 22 &&
  p.credit.alternative_data_signals.utility_on_time_24mo >= 22 &&
  p.income.employer_tenure_months >= 24
)
Surface ITIN-filer mortgage-product candidates

Returns ITIN-filing households with sufficient documented income and savings to qualify for an ITIN mortgage product, including the alternative data needed to underwrite without traditional credit history.

prospects.filter(p =>
  p.demographics.itin_filer &&
  p.income.documented_annual >= 40000 &&
  p.assets.liquid_savings >= p.income.documented_annual * 0.05
)
Track post-bankruptcy rebuild trajectories

Returns households 18+ months past bankruptcy discharge with a non-zero rebuild score, sorted by months-since-discharge — a queue for second-chance lending product offers.

prospects.filter(p =>
  p.credit.bankruptcy_history?.months_since_discharge >= 18 &&
  p.credit.post_bankruptcy_recovery_score > 0
).sort((a, b) =>
  a.credit.bankruptcy_history.months_since_discharge -
  b.credit.bankruptcy_history.months_since_discharge
)
Compute cash-income proxy quality

Returns households where documented income is supplemented by an estimated cash-income component, with the confidence interval on the estimate — useful for cash-economy underwriting where formal documentation is incomplete.

prospects.map(p => ({
  id: p.id,
  documented: p.income.documented_annual,
  cash_estimate: p.income.cash_income_estimate.value,
  ci_low: p.income.cash_income_estimate.ci_low,
  ci_high: p.income.cash_income_estimate.ci_high
})).filter(x => x.cash_estimate > 0)

Methodology

Each household is generated against archetype-specific distributions sourced from CFPB consumer surveys, FDIC unbanked/underbanked data, and Pew demographic studies on immigrant financial behaviour. Alternative data signals are calibrated against published utility-payment and rent-payment behavioural datasets (Experian RentBureau, Equifax NCTUE) so the realism extends to the noise structure, not just the means. Cash-income estimates use a structured methodology (sampling-based with reported confidence intervals) rather than a deterministic guess, since cash-economy income is genuinely uncertain. The corpus passes the WealthSynth consistency validator and the LLM-as-judge quality gate, with additional review by a fair-lending consultant to verify the alternative-data structures align with current CFPB and OCC guidance on inclusive underwriting. Annual refresh tracks regulatory updates and any changes in alternative-data acceptability for credit decisions.

Included Archetypes (10)

Frequently asked questions

Is this Data Set CRA-compliant for examination purposes?+

The Data Set is calibrated to align with CRA examination focus areas (LMI lending, alternative data acceptability, second-chance lending product design), but a regulator's view of CRA compliance depends on the bank's actual lending activity, not the data used to design it. The Data Set serves as a development and testing tool; CRA-credit-eligibility for actual loans depends on borrower facts.

How are protected-class fields handled in the corpus?+

Protected-class indicators (race, ethnicity, religion) are NOT in the default household record. They appear only in the conditional privacy overlay, which is populated for B08 and B26 households per the WealthSynth privacy contract. B23 households do not carry these fields by default — fair-lending testing on this corpus is intentionally agnostic to protected class.

Are alternative data signals from real consumers?+

No. All data is synthetic. The signals are calibrated against published distributions from Experian RentBureau, Equifax NCTUE, and CFPB studies, but every individual household record is generated from those distributions — not derived from any real consumer.

Can I use this to validate ECOA fair-lending compliance?+

Yes — the corpus is designed for fair-lending testing, including ECOA, the FHA, and CFPB UDAAP focus areas. The pre-labelled archetype taxonomy lets you run the firm's adverse-action and decision-explanation logic against the populations regulators most often cite in fair-lending reviews.

How does this differ from B29 (CDFI / Underbanked)?+

B23 is lending-focused: alternative data for credit underwriting, second-chance loan products, ITIN mortgage, post-bankruptcy rebuild. B29 is transactional / banking-focused: prepaid-to-checking transitions, remittance corridors, cash-economy banking-relationship onboarding. Many CDFIs and inclusive-finance fintechs purchase both — they cover adjacent but distinct product surfaces.

Does the cash-income estimate have a confidence interval?+

Yes — `income.cash_income_estimate` is a structured object with `value`, `ci_low`, `ci_high`, and `methodology` fields. Cash-economy income is genuinely uncertain; pretending otherwise would produce a model that's confident on noise. The confidence interval is calibrated against the IRS's tax-gap analysis methodology for unreported cash income.

Are remittance flows in the data?+

For households where remittances are typical (recent immigrants from sender-region countries), yes — outflow amount, frequency, destination corridor, and channel (bank wire, MTO, crypto). The methodology uses World Bank remittance corridor data plus CFPB studies on US-side sender behaviour.

Is this enough data to actually train a production credit model?+

100 households is a development and validation corpus, not a production training set. For production credit modelling, you'd combine this with your firm's own lending-decision history (where available) or extend the population through additional generation. Reach out about custom 1,000+ household generation if a larger underserved-segment corpus is needed for ML training.

Related Wealth Data Sets

$4,000
one-time purchase
100 households (ZIP)
Methodology PDF
JSON, CSV formats
Account required to purchase

Purchases are for internal use only. Redistribution or resale of data is prohibited under the WealthSchema Data License.

View data license →