Inclusive lending sounds like an aspiration until you sit down to build a model and discover that your historical training data systematically excludes the populations you're trying to serve. The customers who could pay you back if you'd extend credit aren't in the loss-loan tape because you never extended credit to them. The ones who you did approve are the modal prospects who fit the existing scorecard. Every iteration of model training reinforces the same selection bias. The CRA / Underserved Lending Pack is 100 households deliberately built to break that loop — alternative data signals, thin-file profiles, post-bankruptcy rebuilders, and ITIN filers, each with the structural data your model needs to score them on actual repayment capacity rather than on whether they look like your existing customer base.
Community Reinvestment Act compliance is one driver of inclusive-lending investment, but it's not the dominant one. The dominant driver is that the underserved segments are growth markets — first-generation wealth-building families, recent immigrants, gig-economy workers, post-bankruptcy rebuilders. Together they represent tens of millions of households that traditional credit scoring under-prices. The lender who solves the data problem captures the segment.
The data problem is specific. Traditional credit data from the bureaus produces a very thin file (or no file) for these segments — tradelines are short, balances are low, history is sparse. The signal-to-noise ratio is wrong for a logistic-regression credit model. What predicts repayment in these segments is alternative data: rent payment history, utility payment patterns, remittance flow, mobile-phone payment regularity, employer continuity, banking-relationship age. Most lenders don't have any of this in test fixtures, so model development is starved before it starts.
This Data Set provides the alternative data. 100 households across the canonical underserved profiles — thin-file ITIN filers, post-bankruptcy rebuilders, recent immigrants, low-income working families, disabled veterans, cannabis-industry workers — each with the structured alternative-data signals (utility payment history, rent payment history, cash-income estimates, remittance flows) that actually predict repayment.
Trains and validates an alternative-data credit model on a corpus structured for the segment, ensuring the model exhibits acceptable AUC on the populations the institution is mandated to serve, not just on the modal mainstream borrowers.
Demonstrates to examiners that the bank's lending products are designed and tested specifically for low-to-moderate-income census tracts, with realistic test cases that exercise the inclusive-lending decision logic.
Validates the product's onboarding and credit logic against ITIN filers and recent immigrants, ensuring KYC paths work and credit decisions don't silently fail for the prospects the product is positioned for.
Tests the firm's lending decisions for disparate impact using a labeled test set where protected-class indicators are documented (where consent was given) and the decision logic can be evaluated against ECOA-compliant outputs.
Builds the post-bankruptcy rebuild lending product with realistic 24-month-out-of-discharge profiles, validating that the credit model rewards the rebuild trajectory rather than punishing the historical bankruptcy indefinitely.
The 100 households span ten archetypes covering the canonical underserved profiles: F-04 First-Generation Wealth Builders, A-02 Single Parents, S-02 Bankruptcy Recovery, U-01 Unbanked / Recently Banked, U-02 Low-Income Working Families, U-03 Recent Immigrants, B-01 Financial Anxiety / Avoiders, MV-03 Disabled Veterans, N-04 Cannabis-Industry Workers, and X-04 Neurodiverse / Disability Households.
Every household carries the alternative data signals that drive repayment prediction in these segments: 24+ months of rent payment history, utility payment regularity, mobile-phone billing relationships, banking-relationship age (some are recently banked from prepaid-card-only history), remittance outflow patterns where applicable, and cash-income estimation methodology with confidence intervals. Credit data uses thin-file flagging where appropriate; about 35% of the corpus is intentionally thin-file or no-file. Post-bankruptcy households carry a structured rebuild trajectory with re-establishment milestones (secured card, first new tradeline, mortgage re-qualification eligibility).
The Data Set ships as JSON and CSV. The WealthSynth Methodology PDF documents the alternative-data taxonomy, the cash-income estimation methodology, the calibration sources for the underserved-segment profiles (CFPB consumer studies, FDIC unbanked surveys, Pew immigration data), and the specific use cases each archetype is designed to exercise.
A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 100 like it) ships in the ZIP.
{
"credit.alternative_data_signals": <value>,
"credit.thin_file_flag": <value>,
"demographics.itin_filer": <value>,
"income.cash_income_estimate": <value>,
"credit.post_bankruptcy_recovery_score": <value>
}Returns prospects whose traditional credit file is thin (fewer than 4 tradelines or no FICO) but whose alternative data shows reliable rent + utility + employer continuity for 24+ months — the highest-conversion underwriting opportunity in the segment.
prospects.filter(p => p.credit.thin_file_flag && p.credit.alternative_data_signals.rent_on_time_24mo >= 22 && p.credit.alternative_data_signals.utility_on_time_24mo >= 22 && p.income.employer_tenure_months >= 24 )
Returns ITIN-filing households with sufficient documented income and savings to qualify for an ITIN mortgage product, including the alternative data needed to underwrite without traditional credit history.
prospects.filter(p => p.demographics.itin_filer && p.income.documented_annual >= 40000 && p.assets.liquid_savings >= p.income.documented_annual * 0.05 )
Returns households 18+ months past bankruptcy discharge with a non-zero rebuild score, sorted by months-since-discharge — a queue for second-chance lending product offers.
prospects.filter(p => p.credit.bankruptcy_history?.months_since_discharge >= 18 && p.credit.post_bankruptcy_recovery_score > 0 ).sort((a, b) => a.credit.bankruptcy_history.months_since_discharge - b.credit.bankruptcy_history.months_since_discharge )
Returns households where documented income is supplemented by an estimated cash-income component, with the confidence interval on the estimate — useful for cash-economy underwriting where formal documentation is incomplete.
prospects.map(p => ({
id: p.id,
documented: p.income.documented_annual,
cash_estimate: p.income.cash_income_estimate.value,
ci_low: p.income.cash_income_estimate.ci_low,
ci_high: p.income.cash_income_estimate.ci_high
})).filter(x => x.cash_estimate > 0)Each household is generated against archetype-specific distributions sourced from CFPB consumer surveys, FDIC unbanked/underbanked data, and Pew demographic studies on immigrant financial behaviour. Alternative data signals are calibrated against published utility-payment and rent-payment behavioural datasets (Experian RentBureau, Equifax NCTUE) so the realism extends to the noise structure, not just the means. Cash-income estimates use a structured methodology (sampling-based with reported confidence intervals) rather than a deterministic guess, since cash-economy income is genuinely uncertain. The corpus passes the WealthSynth consistency validator and the LLM-as-judge quality gate, with additional review by a fair-lending consultant to verify the alternative-data structures align with current CFPB and OCC guidance on inclusive underwriting. Annual refresh tracks regulatory updates and any changes in alternative-data acceptability for credit decisions.
The Data Set is calibrated to align with CRA examination focus areas (LMI lending, alternative data acceptability, second-chance lending product design), but a regulator's view of CRA compliance depends on the bank's actual lending activity, not the data used to design it. The Data Set serves as a development and testing tool; CRA-credit-eligibility for actual loans depends on borrower facts.
Protected-class indicators (race, ethnicity, religion) are NOT in the default household record. They appear only in the conditional privacy overlay, which is populated for B08 and B26 households per the WealthSynth privacy contract. B23 households do not carry these fields by default — fair-lending testing on this corpus is intentionally agnostic to protected class.
No. All data is synthetic. The signals are calibrated against published distributions from Experian RentBureau, Equifax NCTUE, and CFPB studies, but every individual household record is generated from those distributions — not derived from any real consumer.
Yes — the corpus is designed for fair-lending testing, including ECOA, the FHA, and CFPB UDAAP focus areas. The pre-labelled archetype taxonomy lets you run the firm's adverse-action and decision-explanation logic against the populations regulators most often cite in fair-lending reviews.
B23 is lending-focused: alternative data for credit underwriting, second-chance loan products, ITIN mortgage, post-bankruptcy rebuild. B29 is transactional / banking-focused: prepaid-to-checking transitions, remittance corridors, cash-economy banking-relationship onboarding. Many CDFIs and inclusive-finance fintechs purchase both — they cover adjacent but distinct product surfaces.
Yes — `income.cash_income_estimate` is a structured object with `value`, `ci_low`, `ci_high`, and `methodology` fields. Cash-economy income is genuinely uncertain; pretending otherwise would produce a model that's confident on noise. The confidence interval is calibrated against the IRS's tax-gap analysis methodology for unreported cash income.
For households where remittances are typical (recent immigrants from sender-region countries), yes — outflow amount, frequency, destination corridor, and channel (bank wire, MTO, crypto). The methodology uses World Bank remittance corridor data plus CFPB studies on US-side sender behaviour.
100 households is a development and validation corpus, not a production training set. For production credit modelling, you'd combine this with your firm's own lending-decision history (where available) or extend the population through additional generation. Reach out about custom 1,000+ household generation if a larger underserved-segment corpus is needed for ML training.
80 underbanked and underserved households: prepaid-card users, check-cashing customers, ITIN filers, post-bankruptcy unbanked, and cash-economy participants. Companion to B23 (CRA) but focused on transactional / banking inclusion rather than lending.
180 households with detailed student loan data: loan types, servicers, IDR plan enrollment, PSLF qualifying-payment counts, refinancing history, and forgiveness-tax-bomb projections. Includes Parent PLUS borrowers and double-consolidation paths.
400 prospect households covering RIA client variety from formation through retirement. KYC-complete records, goal-based planning fields, initial recommendation outputs, and CRM-compatible field naming. The broadest single bundle by archetype coverage.
Purchases are for internal use only. Redistribution or resale of data is prohibited under the WealthSchema Data License.
View data license →