AML transaction monitoring engine design — a build, test, and validation guide

WealthSchema StaffAML / BSA modelingMay 25, 20267 min read

Most AML transaction monitoring engines fail in the same way. Not loudly — quietly, in production, generating either too many alerts (and exhausting the analysts who clear them) or too few (and missing exactly the patterns FinCEN advisories were warning about). The miss isn't usually in the rule logic. It's in the test corpus the rule logic was tuned against.

This guide is for the engineering and compliance teams designing, tuning, or validating a transaction-monitoring engine. It walks through the FinCEN-recognized typologies that drive the regulatory expectation, the data-model decisions that determine whether a typology can be detected at all, and the specific testing methodology that produces an engine the BSA Officer can sign off on and an examiner can defend.

If you've published our AML / BSA Transaction Monitoring Data Field Checklist on the wall above your monitor, this is the long-form companion that explains why each field on the checklist matters.

The four AML failures regulators care about

Before any engine design, it's worth being precise about what "failing" means. FinCEN, OCC, and state regulators have collectively brought a generation of consent orders against firms with AML deficiencies. The orders cluster into four failure modes:

Failure 1

Scope too narrow

Engine looks at wires but not ACH. Or retail customers but not small-business deposit accounts. Scope failures are the single most common consent-order finding.

Failure 2

Detection rules don't match documented typologies

The classic example: detects single-day cash deposits over $10,000 but doesn't detect the structured pattern of $9,500 across multiple branches over a week.

Failure 3

Alert clearance unsupervised / undocumented

Engine fires; analysts clear; nobody can reproduce the analyst's reasoning two years later. A documentation problem masquerading as a detection problem.

Failure 4

Escalation and SAR-filing not auditable

The path from 'alert' to 'SAR filed' involves judgment that, if not captured, gives an examiner nothing to evaluate except outcomes.

A defensible engine has to engineer against all four. This guide focuses primarily on (1) and (2) — the engineering-team's domain. Failures (3) and (4) are mostly process problems and live in the BSA Officer's playbook.

The canonical FinCEN typologies your engine has to detect

FinCEN publishes typologies in advisories and the SAR Stats reports. The list isn't fixed — typologies evolve as criminal patterns evolve — but the structural categories have been stable for over a decade:

Structuring (31 CFR 1010.314). Deliberately breaking transactions into amounts below the $10,000 CTR threshold. The simple version is one customer making multiple sub-threshold cash deposits. The harder versions involve multiple accounts, multiple branches, multiple days, and intermediary parties. An engine that only detects the simple version will fail an exam the first time the examiner asks for the multi-account pattern.

Layering and integration. Money moves through a sequence of accounts to obscure its origin. Layering typologies require a transaction-graph view, not a per-transaction view. An engine with no concept of the graph between accounts cannot detect layering at all.

Source-of-funds anomalies. Activity inconsistent with the customer's KYC profile. A retiree on Social Security suddenly receiving $40K monthly wires. A college student depositing $25K cash. The detection requires expected behavior baselines per customer segment — meaning your engine has to know what each archetype's normal activity range is, which means your data model has to capture customer archetype.

PEP and PEP-adjacent activity. Transactions involving PEPs, PEP family members, or close associates. Detection depends on accurate sanctions and PEP screening at customer onboarding and at counterparty resolution on every transaction.

Funnel-account activity. Multiple deposits in different geographies funneling into a single account, often followed by a wire transfer out. Detection requires geographic dispersion analysis on the deposit side.

Trade-based money laundering (TBML). Over- or under-invoicing of goods, repeated correction of trade documents, payments inconsistent with the underlying trade.

Cyber-enabled fraud and account takeover. Distinct from laundering typologies but increasingly bundled into AML monitoring scope by examiners.

Human-trafficking and child-exploitation typologies. FinCEN has issued specific advisories with detection patterns — hospitality charges in unusual geographies, repeated peer-to-peer payments to many counterparties of similar small amounts, payments to specific merchant categories.

Elder-financial-exploitation typologies. FinCEN Advisory FIN-2022-A002 and several precursors. Sudden large withdrawals by elderly customers, new account-signing authority granted to a non-relative, payment patterns indicative of grandparent scams.

The structural test for an AML engine is whether it can fire on each of these typology categories independently and in combination. Most production engines are strong on structuring and PEP screening, weak or silent on layering, source-of-funds anomalies, and the human-trafficking and elder-exploitation typologies.

What your data model has to capture

Your engine's detection ceiling is determined by your data model's capture floor. Specifically:

Per-transaction. Timestamp at sub-second precision; amount; currency; channel (wire, ACH, card, internal transfer, cash); originator account; beneficiary account; originator bank; beneficiary bank; originating geography; beneficiary geography; payment-network reference IDs; correlated transactions.

Per-customer. Archetype classification (the SCF-aligned segment the customer falls into — see our archetype catalog); KYC profile data (occupation, source of wealth, expected transaction volume range); sanctions and PEP status; beneficial-ownership graph for entity customers; relationship graph for related parties.

Per-account. Account type; opening date; signers; signature-authority changes over time; cross-references to other accounts at the firm and at known related institutions.

Per-counterparty. Internal: full account context. External: highest-fidelity counterparty data the channel provides.

Temporal context. Rolling baselines per customer (30/90/365-day profiles); rolling baselines per archetype; recent-velocity features.

Outside-the-firm context. Device fingerprinting; IP geolocation; behavioral biometrics; known-bad lists; sanctions list synchronization timestamps.

Most production engines capture per-transaction and per-customer fields well, per-account and per-counterparty acceptably, and temporal and external-context fields poorly. The temporal-context gap is where the engine misses structuring and layering. The external-context gap is where the engine misses cyber-enabled fraud.

Designing the detection rule layer

A defensible rule layer has three tiers:

	What it does	Cadence	Infrastructure
Tier 1 — Threshold rules	Fire on absolute thresholds (single-tx CTR, structuring-amount wire)	Synchronous, batched daily	Deterministic, easy to audit
Tier 2 — Behavioral baselines	Departures from per-customer / per-archetype expected behavior	Rolling baselines maintained continuously, near-real-time fire	Requires temporal context + customer archetype
Tier 3 — Pattern-graph rules	Multi-tx, multi-account patterns (funnel, layering, coordinated structuring)	Hourly or shift-based graph computation	Computationally + operationally expensive

A common design mistake is to treat all three tiers as a single rule engine. The right design typically separates them by cadence and infrastructure. Conflating the three produces an engine that's either too slow to fire on real-time threats or too noisy to operate at scale.

How to test an AML engine

This is the part most teams under-invest in. The patterns AML engines need to detect are, by definition, rare in production data — your real customers are mostly legitimate, and the cases that exist are typically ones you've already filed SARs on, which means they're not great negative examples for testing.

The structural answer is a synthetic test corpus calibrated specifically to exercise each typology. The corpus needs:

Five categories your test corpus has to contain

Negative cases: legitimate-but-superficially-anomalous (variable-income small business, gig worker, freelancer paid in lump sums). If the engine alerts on these, you have a false-positive problem.
Pure-typology positive cases: each canonical typology in isolation (pure structuring, pure layering, pure source-of-funds anomaly, pure PEP-adjacent). Tests rule sensitivity without confounders.
Mixed-typology positive cases: multiple typologies in combination (structuring on one account while a related party shows funnel-account behavior). Tests cross-account signal correlation.
Edge cases that should NOT fire (elderly with documented POA activity, self-employed with documented variable-income profile, family-business cash patterns explained by KYC notes). Tests contextual exoneration.
Adversarial cases: deliberately designed to evade (sub-threshold structuring spread across rolling-window cutoffs, layering through enough accounts that finite-depth graph search misses it). Tests robustness against pattern evolution.

Our Cash Flow Stress Test Dataset and Behavioral Finance Coaching Pack supply the negative-case and edge-case material. Our CDFI / Underbanked Pack supplies the legitimate-but-superficially-anomalous cases that drive false positives in production. None of these alone is sufficient for typology-positive testing — that's a gap we'd happily produce a dedicated AML pack to fill, and we're collecting requirements from teams in this space; if you have a use case, tell us.

The SR 11-7 model-validation overlay

If your firm is bank-affiliated or operates under examination by a federal banking regulator, the AML engine sits within scope of SR 11-7 model validation. The three pillars — conceptual soundness, ongoing monitoring, and outcomes analysis — each have specific implications for transaction monitoring:

Conceptual soundness. The engine's rule logic must be documented, the typology mapping must be explicit, and the data dependencies must be auditable. "Our engine fires when this regex matches the wire memo field" is not a defensible documentation standard. "Our engine implements the structuring detection pattern in FinCEN Advisory X using the data fields documented in our schema, with rule sensitivity tuned to the published advisory thresholds and our internal false-positive tolerance" is.

Ongoing monitoring. The engine's performance must be re-tested at least annually and after any significant model change. This is where a stable synthetic test corpus pays for itself across years — the same corpus, run against the engine before and after a tuning change, produces directly comparable performance metrics.

Outcomes analysis. SAR-filing rates, alert-clearance rates, alert-false-positive rates, and analyst-disposition consistency must be tracked over time.

We've published a separate SR 11-7 model validation playbook that walks through this in operational detail.

A defensible build sequence

Step 1
Inventory typology coverage requirements
Pull FinCEN advisories applicable to your customer base, geography, and product mix. Map each typology to a detection requirement. The BSA Officer signs off on this as engine scope.
Step 2
Audit data-model capture against requirements
For each typology, walk through whether your data model contains the fields required to detect it. Gaps here are upstream engineering work.
Step 3
Build the rule layer in tiers
Tier 1 first (deterministic), Tier 2 next (behavioral baselines), Tier 3 last (graph patterns).
Step 4
Establish synthetic test corpus before tuning
Tuning thresholds against production data is contaminating — production is the very distribution the engine has to perform on. Tune against synthetic, validate against held-out production.
Step 5
Document the alert-clearance workflow
Every alert disposition must produce a record an examiner can read. The workflow is the engine in the regulatory sense.
Step 6
Run SR 11-7 docs in parallel with engineering
Six months later it's a research project. Built into the cadence, it's a few hours per week.
Step 7
Schedule first validation cycle BEFORE launch
Most firms launch then schedule validation. Reverse the order: model the validation cycle first, build the engine to satisfy it, run the cycle as the launch criterion.

What examiners actually ask

AML program exams tend to converge on a recognizable set of questions. They cluster:

Six questions a strong program can answer cold

Show me the typologies your engine detects, mapped to the FinCEN advisories.
Show me a sample of cleared alerts. Walk me through why each was cleared.
Show me your false-positive rate and how it has trended.
Show me your SAR-filing rate, year over year, broken down by typology.
Show me what changed in the engine in the last twelve months and the validation that supported each change.
Show me a typology your engine *doesn't* detect, and your justification for the scope.

A team that can answer all six confidently is in good shape. A team that can answer the first five but not the sixth — i.e., a team that hasn't explicitly thought about scope decisions — is in worse shape than they realize, because the sixth question is the one the examiner often saves for last.

Closing

AML transaction monitoring is one of the highest-stakes systems in any fintech, and one of the most expensive to get wrong — both in regulatory cost and in the operational cost of running an engine that produces too many alerts to clear. The engineering work and the testing work are inseparable: an engine no one trusts produces false positives until analysts learn to ignore it; an engine no one tests produces false negatives until an examiner finds them.

The discipline that distinguishes a defensible AML engine from a fragile one is the discipline of testing against a corpus that exhibits the patterns the engine is supposed to detect before those patterns appear in production. That's a synthetic-data problem, and it's one of the cleanest cases for archetype-driven generation we know.

If you'd like to discuss the specific typology coverage your engine needs and how a synthetic test corpus could exercise it, reach out. And if you're starting fresh, the free sample on GitHub lets you inspect the schema before you commit to anything.

Key takeaways

The four most-cited AML deficiencies cluster around scope, typology coverage, alert-clearance documentation, and SAR-filing auditability — engineering owns the first two.
Detection ceiling = data-model floor: layering and source-of-funds anomalies require a transaction graph and per-archetype behavioral baselines that most engines don't carry.
Separate the rule layer into three tiers (threshold / behavioral baseline / pattern graph) by cadence and infrastructure — conflating them produces an engine that's either too slow or too noisy.
Tune against a synthetic corpus, validate against held-out production. Tuning rule thresholds against production data is contaminating by definition.
SR 11-7 documentation written six months after launch is a research project; written in parallel with engineering, it's a few hours a week and the difference between 'we have an engine' and 'we have a defensible program.'

Related reading:

This document is general guidance for engineering and compliance teams designing transaction-monitoring engines. It is not legal advice and is not a substitute for the firm's BSA Officer's program documentation. Firms with active examinations or specific regulatory questions should consult qualified counsel.