Recall (Sensitivity): Definition and Use in Compliance
Recall (sensitivity) is a classification model performance metric that measures the proportion of actual positive cases a model correctly identifies. In financial crime compliance, it quantifies how many real crimes a detection system catches out of all crimes that actually occurred.
What is Recall (Sensitivity)?
Recall is the fraction of actual positive cases that a classification model correctly identifies. The formula: True Positives divided by the sum of True Positives and False Negatives. If your transaction monitoring system reviews 1,000 real money laundering transactions and flags 850, recall is 85%. The 150 it missed are false negatives: real crimes the system didn't catch.
In financial crime, false negatives have direct legal consequences. A transaction monitoring model that misses a reportable transaction means a Suspicious Activity Report never gets filed. That's a Bank Secrecy Act failure whether the model was technically sophisticated or not. Regulators don't grade on effort.
"Sensitivity" is the exact statistical equivalent of recall, borrowed from epidemiology. A diagnostic test with 80% sensitivity misses 20% of people who actually have the condition. A fraud model with 80% recall misses 20% of genuine fraud transactions. The math is identical. Financial crime compliance adopted both terms as automated detection systems became standard in the early 2000s, and you'll see them used interchangeably in model validation documentation.
Recall sits in direct tension with precision. Precision measures what fraction of your flagged transactions are genuinely suspicious. When you lower the model's decision threshold to catch more real crimes, you also flag more clean transactions. Recall goes up; precision goes down. There's no free lunch. Every compliance team is always accepting some level of false positives to achieve a given recall rate.
The confusion matrix puts recall in full context. It shows all four outcome types: True Positives (correctly flagged), True Negatives (correctly cleared), False Positives (wrongly flagged), and False Negatives (wrongly missed). Recall is one ratio in that matrix: TP / (TP + FN). Reading the full matrix reveals what single metrics hide. A model can achieve 99.9% overall accuracy on an imbalanced dataset by predicting "clean" for everything, and its recall for financial crime is exactly zero.
How is Recall (Sensitivity) used in practice?
The most common application is transaction monitoring model validation. At least annually, or after any significant change in the payment environment, model risk teams test the current model against labeled datasets of confirmed suspicious activity. The recall result goes into the validation report and is compared against the prior period. A drop from 84% to 71% over 12 months is a material finding that requires investigation: criminal patterns may have shifted, data quality may have degraded, or the model's training distribution no longer matches current behavior.
Here's a concrete example. A US community bank was filing 200 SARs per month in 2022. OCC examiners noted that peer institutions with similar transaction volumes were filing 400 to 500. The bank ran a retrospective analysis and found its monitoring model had 61% recall on structuring patterns. After threshold adjustment and adding a velocity rule, recall improved to 83% and monthly SAR filings rose to 380.
Alert queue sizing is another practical application. If recall is 80% and the institution processes 5 million transactions daily, the model misses roughly 1,000 per day that could be reportable (assuming a 0.025% base rate of suspicious activity). Knowing that number helps the Money Laundering Reporting Officer quantify residual risk and decide whether the current detection rate is acceptable given the institution's risk appetite and investigator capacity.
Sanctions screening operates under a different recall standard. OFAC penalties run per violation, so most compliance programs set a near-100% recall target and accept very low precision in return. A major global bank might generate 40,000 alerts daily to catch 12 genuine SDN hits. Analysts know they're reviewing mostly noise, but no compliance committee is willing to miss a sanctions match to reduce workload.
The MLRO owns recall performance across the full program. When examiners ask whether the institution is catching what it should be catching, the answer must come with evidence: backtesting results, typology coverage analysis, and recall metrics tracked over time. "We believe our systems are effective" is not sufficient. Numbers are.
Recall (Sensitivity) in regulatory context
No regulation uses the term "recall" directly. The obligation is functional: demonstrate that your detection systems catch financial crime at a rate appropriate to your risk profile.
The Federal Reserve's SR 11-7 guidance on model risk management, published in April 2011, set the standard that US banking supervision now applies broadly. It requires model performance to be validated against out-of-sample data and error rates to be documented. For a transaction monitoring model, error rates include recall. SR 11-7 also requires "ongoing monitoring to confirm the model continues to perform as intended." Unexplained recall decay is exactly the kind of performance change that monitoring is designed to surface before an examiner does.
FinCEN's SAR filing requirements under 31 U.S.C. § 5318(g) create an indirect floor on acceptable recall. An institution that misses reportable transactions because its model has inadequate detection coverage has a BSA compliance failure. The 2012 HSBC consent order with the Department of Justice, resulting in a $1.92 billion settlement, cited systematic detection failures across multiple product lines. "Recall" doesn't appear in the consent agreement, but the underlying problem was that suspicious transactions flowed through undetected at scale.
The Financial Action Task Force's Recommendation 1 on the risk-based approach requires detection intensity to match risk level. In practice, higher-risk customer segments need higher recall thresholds. A politically exposed person account warrants tighter detection parameters than a standard retail account, because the consequence of a missed transaction is greater.
The EBA's 2021 Guidelines on internal governance (EBA/GL/2021/05) extended model risk management expectations to AML models. AI-based detection systems are now subject to the same documentation and validation requirements as credit scoring models, including tracking recall over time as part of ongoing model performance obligations.
UK-regulated firms face equivalent requirements under FCA's Systems and Controls sourcebook (SYSC 6.3). The FCA's 2021 enforcement action against NatWest, which resulted in a £264.8 million fine, cited the bank's failure to detect and report £365 million in suspicious deposits from a single customer. Inadequate detection coverage is the heart of that case.
Common challenges and how to address them
The hardest part of managing recall is that you don't know what you missed. A false negative is invisible by definition. The transaction cleared. No alert fired. You only discover the gap through backtesting, typology exercises, peer benchmarking, or when an examiner points to activity that should have generated a Suspicious Transaction Report.
Class imbalance is the root structural problem. In a large retail bank's portfolio, genuinely suspicious transactions might represent 0.01% of volume. A model trained on this data can achieve 99.99% overall accuracy by predicting "clean" for everything. Its recall for financial crime is zero. This is why model validation reports should never rely on overall accuracy alone. Recall on the positive class is the number that matters.
Threshold tuning is the primary lever for recall adjustment. Every classification model produces a risk score. The threshold determines what score generates an alert. Lower the threshold and recall rises. Raise it and recall falls. The F1 score combines recall and precision into a single summary number, useful for comparing candidate thresholds. ROC AUC gives you the full recall/false-positive-rate tradeoff curve across all thresholds, which lets you identify the optimal operating point before committing.
Concept drift is a slower-moving problem. Fraud and money laundering patterns change. A model trained on 2022 data may have high recall on 2022 patterns but miss synthetic identity fraud techniques that became common in 2024. Ongoing model monitoring against recent labeled data is the only reliable way to catch this degradation before an examiner does.
Explainability connects to recall in a way that's often underappreciated. When analysts can't understand why a model flagged a transaction, they're more likely to close valid alerts incorrectly. That converts true positives into effective false negatives at the disposition stage. This adds latency to investigations, but the accuracy gain from transparent models is worth it: systems that surface clear reasoning retain more of their mathematical recall in practice, because analysts don't second-guess results they can verify.
Related terms and concepts
Recall is one vertex of a measurement triangle that includes precision and the F1 score. Precision measures what fraction of your flagged transactions are genuinely suspicious. The F1 score is their harmonic mean, useful when you need a single number to compare model versions or threshold settings. In compliance, teams track both because the failure modes cost different things. Low precision wastes analyst time. Low recall misses crimes.
The false negative rate is the arithmetic complement of recall: if recall is 84%, the false negative rate is 16%. Both express the same underlying number. Some compliance teams prefer the false negative rate framing because it makes the miss rate explicit. Telling the board "our model has 84% recall" is less visceral than saying "our model misses 16% of financial crimes."
Specificity (the true negative rate) measures the other side: what fraction of clean transactions does the model correctly clear? High specificity means low false positive rates. Banks that prioritize specificity tend to have manageable alert volumes but may carry more recall risk than they realize.
The confusion matrix shows all four outcomes together: True Positives, True Negatives, False Positives, and False Negatives. Recall uses only two of those cells. Reading the full matrix provides the context that any single metric hides.
ROC AUC aggregates recall and false positive rate across all possible thresholds into a single curve. A model with AUC of 0.95 consistently separates suspicious from clean transactions regardless of where the threshold is set. It's the standard summary metric in model validation exercises and gives examiners a threshold-independent view of model performance.
AI governance frameworks now require recall to be tracked as part of ongoing model performance reporting. Under SR 11-7 and EBA/GL/2021/05, institutions that can't produce a recall trend line for their core AML model are operationally blind to whether that model still works. Regulatory examiners ask for exactly that documentation, and "we believe performance is adequate" without data to support it is an examination finding.
Where does the term come from?
The term "recall" comes from information retrieval research. Cyril Cleverdon's Cranfield experiments in the 1960s, which evaluated document retrieval systems at Cranfield College of Aeronautics, established recall and precision as the standard evaluation pair. "Sensitivity" arrived from clinical epidemiology, where it described what fraction of diseased patients a diagnostic test correctly identifies. Both terms entered financial crime compliance in the early 2000s as automated transaction monitoring became standard practice. The Federal Reserve's SR 11-7 guidance (April 2011) formalized the expectation that AML models be evaluated using statistical performance metrics, embedding recall into regulatory examination practice.
How FluxForce handles recall (sensitivity)
FluxForce AI agents monitor recall (sensitivity)-related patterns in real time, flag anomalies for analyst review, and generate evidence-backed decisions with full audit trails.