A mortgage origination engine has to evaluate the borrower (income, debt, credit, employment), the property (appraisal, type, location, encumbrances), the loan structure (LTV, DTI, ratio compliance, program eligibility), and the regulatory overlay (TRID, RESPA, ECOA, HMDA, state-specific rules) — usually within a 24-72 hour timeline from application to commitment letter. The engine has to be both accurate and fast, with regulator-defensible decisions.
The typical test corpus for a mortgage engine reflects the typical borrower: W-2 income from a single employer, single-state residence and employment, single-family residence in good condition, conventional conforming loan. Real intake is much messier than that. This article is the working note on the messier scenarios — and the synthetic-data shape needed to exercise them.
What the engine has to do
A mortgage engine's decision pipeline:
- Stage 1Application intake and pre-qualificationReceive 1003 (Uniform Residential Loan Application). Validate completeness. Compute initial DTI, LTV, eligibility for programs.
- Stage 2Documentation collection and verificationPay stubs, W-2s, bank statements, tax returns. Employment verification. Asset verification. Each document type has parsing and reconciliation logic.
- Stage 3Property evaluationAppraisal ordering, review, and reconciliation. Title search. Property type classification (SFR / condo / manufactured / multi-family / mixed-use).
- Stage 4Underwriting decisionCompute final ratios, evaluate program eligibility (conventional / FHA / VA / USDA / non-QM), apply guideline overlays, produce approve / counter-offer / decline decision.
- Stage 5Disclosure and closingTRID-compliant disclosures (Loan Estimate, Closing Disclosure). Tolerance compliance. Final closing instructions.
- Stage 6Funding and post-closingFunding instructions to title/escrow. Post-closing audit, HMDA reporting, secondary market sale.
Each stage is its own engine. A complete test corpus has to exercise all of them.
The borrower edge cases
The borrower scenarios that produce production failures:
Borrower edge cases
- Self-employed / 1099 with two-year averaging — income is highly variable. Engine has to compute the lower-of-current-year or two-year-average rule. Recent declines should reduce calculated income, not be averaged away.
- Gig-platform income (Uber, DoorDash, Upwork) — 1099-NEC with no employer letter possible. Verification flow differs from W-2.
- Multiple income streams — W-2 plus 1099 plus K-1 plus rental. Each income type has its own qualifying-income calculation. Engines that sum naively over-state qualifying income.
- Bonus / commission income — qualifying income requires 2-year history minimum. Engines that include current-year bonus without history are non-compliant with most agency guidelines.
- ITIN-only filers — apply with Individual Taxpayer Identification Number rather than SSN. Engines that assume SSN-shaped IDs reject these. ITIN-only mortgage products exist but require specific underwriting paths.
- Asset-only qualification — borrowers with substantial assets but limited income (retirees, between jobs). Asset-depletion calculations are program-specific.
- Foreign nationals — non-resident aliens, foreign-employer income, foreign-currency holdings. Specialized programs exist; verification is non-trivial.
- Recently-bankrupt or foreclosed borrowers — Chapter 7 discharged 4+ years ago is conventionally lendable; foreclosure 3+ years ago is conventionally lendable. Engines that auto-decline are non-compliant with fair-lending guidelines.
- First-time homebuyers in CRA areas — eligible for special programs (down-payment assistance, lower rates). Engine has to identify CRA applicability and route to the appropriate path.
- Cosigner / non-occupant co-borrower structures — common in student-loan-pressed first-time buyer scenarios. Income aggregation rules vary by program.
The property edge cases
Property scenarios that produce production failures:
Property edge cases
- Manufactured / modular homes — different appraisal requirements, different program eligibility (some programs exclude these).
- Mixed-use property — owner-occupied with commercial component (typical bodega-with-apartment setup). Loan-to-value calculations differ; programs may not allow.
- Condo with non-warrantable status — if HOA financials, owner-occupancy ratios, or insurance don't meet agency standards, the property may not be saleable to Fannie/Freddie. Engine has to identify before commitment.
- Properties with deed restrictions — affordable-housing covenants, age-restricted communities, ground leases. Each has program implications.
- Properties in flood zones — flood insurance required; affordability impact. Engine has to compute total housing payment including flood insurance.
- Non-arm's-length transactions — purchase from family, employer-mediated transactions. Different documentation requirements.
- Rural USDA-eligible properties — eligibility map check. Income-eligible as well as property-eligible.
- Investment properties — different LTV / reserves requirements. Rental income contribution to qualification varies by program.
- Multi-unit (2-4 unit) owner-occupied — eligible for primary-residence programs but requires multi-unit-specific calculations.
The compliance overlay
Beyond underwriting accuracy, the engine has to comply with a thicket of regulations:
| Regulation | What it requires | Common engine failure | |
|---|---|---|---|
| TRID (Reg Z + RESPA combined disclosure) | Loan Estimate within 3 days of application; Closing Disclosure 3 days before closing; tolerance compliance on cost categories | Tolerance violations in re-disclosed Loan Estimates | |
| ECOA / Reg B | No discrimination on protected basis; adverse-action notices with principal reasons; HMDA reporting | Disparate-impact patterns across geographic / ethnic dimensions | |
| TILA / QM rules | DTI < 43% for QM safe harbor; rebuttable presumption above; ATR (ability to repay) requirements | Non-QM loans extended without proper ATR documentation | |
| RESPA Section 8 | No kickbacks for referrals; affiliated business arrangement disclosures | Marketing arrangements that drift into kickback territory | |
| HMDA reporting | Per-application reporting of demographic, loan-amount, decision data; LAR (loan-application register) submission | LAR data quality issues that produce HMDA fair-lending audit findings | |
| State licensing and rate caps | State-by-state lending license; usury caps; consumer protection statutes (NY DFS, California DBO, etc.) | Rate quotes above state caps; product offerings unauthorized in particular states |
A test corpus has to include scenarios that exercise each regulatory path. An engine that's underwritingly correct but TRID-non-compliant ships disclosures that produce CFPB findings.
What a working test corpus looks like
A 2,000-loan stress-test corpus, distributed roughly:
- 60% nominal applicants spanning the realistic credit, income, and property distribution
- 25% borrower edge cases covering the inventory above
- 10% property edge cases covering the inventory above
- 5% adversarial / compliance test cases specifically engineered to exercise validation gates
Each loan in the corpus has full documentation: 1003, pay stubs, W-2s, tax returns, bank statements, employment verification, appraisal, title commitment, property data. The synthetic data has to be document-grade — engine code parses documents, not abstract records, and document-level synthesis is the surface area where most bugs ship.
Key takeaways
- Mortgage origination has 6 stages (intake, doc collection, property evaluation, underwriting, disclosure, closing/post-close) each with its own engine. A complete stress test exercises all 6.
- Borrower edge cases include self-employed two-year-averaging, multiple income streams, ITIN filers, asset-only qualification, foreign nationals, recently-bankrupt, and first-time CRA buyers. Each is 1-10% of real intake.
- Property edge cases include manufactured homes, mixed-use, non-warrantable condos, deed-restricted, flood-zone, non-arm's-length, USDA-eligible, investment, and multi-unit. Each requires distinct underwriting paths.
- Compliance overlay is real — TRID, ECOA, TILA QM, RESPA, HMDA, state licensing. Engines that are underwriting-correct but compliance-non-compliant ship disclosures that produce CFPB findings.
- Test corpus has to be document-grade (1003 + pay stubs + W-2s + tax returns + bank statements + appraisal) and include 5% specifically adversarial / compliance test cases to exercise validation gates.