AI-governance Published: Updated: By

What is the difference between recall and precision?

Quick answer

Precision measures how many of a model's flagged cases are actually correct. Recall measures how many genuinely suspicious cases the model catches. In AML and fraud AI, high recall means fewer missed crimes; high precision means fewer false alerts. Improving one typically reduces the other. ---

The full answer

Precision is the fraction of a model's positive predictions that are correct. Recall is the fraction of all actual positives the model catches. Both come from the same confusion matrix:

  • Precision = True Positives / (True Positives + False Positives)
  • Recall = True Positives / (True Positives + False Negatives)

A model that flags every transaction as suspicious has perfect recall (it catches 100% of real crimes) but near-zero precision (almost all its flags are wrong). A model that only flags cases it's 99% confident about has high precision but misses a large share of real activity. Neither extreme is workable.

The tradeoff exists because both metrics share True Positives in the numerator but have different denominators. Lowering the classification threshold increases True Positives but also increases False Positives, which pulls precision down. Raising it reduces False Positives but also reduces True Positives, which pulls recall down.

The F1 score is the harmonic mean of both: F1 = 2 × (precision × recall) / (precision + recall). Use it when comparing models with a single number and when you can't tolerate extreme failure in either direction. F1 approaching 1.0 means both metrics are high.

In practice, compliance teams don't pick thresholds based on F1 alone. They pick based on the acceptable false-negative rate for a given typology. For sanctions screening, the recall floor is essentially 100%: missing a sanctioned payment is a direct regulatory violation. For broader AML transaction monitoring, where 90% or more of alerts are already false positives, the operational cost of precision failures is already enormous. Most AI investment in this space is aimed at lifting precision while holding recall steady.

The Basel Committee on Banking Supervision's 2023 paper on machine learning in credit risk specifically addresses the need for institutions to track discriminatory power metrics over a model's full operational life, not just at deployment. Precision and recall, and how they shift as data distributions change, are part of that obligation.

Why this matters

Recall failures have regulatory consequences. A false negative in AML is a transaction that should have triggered a SAR but didn't. Banks have 30 days to file a SAR after detecting suspicious activity, or 60 days if no suspect has been identified. A model with poor recall at scale creates systematic filing gaps. Examiners find those gaps in look-back reviews.

Consistent model underperformance is one of the conditions that triggers a regulatory exam or escalates an ongoing one. The outcomes for banks that fail AML exams include consent orders, monitorship, and civil money penalties. Failures that can be traced to model recall deficiencies are harder to defend than operational lapses.

The Federal Reserve and OCC's SR 11-7 requires model validation teams to document performance quantitatively. Examiners expect precision-recall curves, confusion matrices, and a written rationale for the operating threshold. A validation report that characterizes model performance in qualitative terms fails the standard.

The EU AI Act (Regulation 2024/1689) raises the bar further. AI systems used in financial services risk assessments are high-risk under Annex III. Article 9 mandates a risk management system; Article 13 requires logging sufficient to evaluate accuracy, precision, and recall over the system's operational lifetime. Who needs to comply with the EU AI Act? Any EU-based deployer of AI in financial services, with obligations phasing in from August 2025 onwards (full timeline here).

For use cases like detecting mule accounts, recall is the dominant concern. A mule account the model misses facilitates further fraud and onward transfers before anyone notices. For AI-based AML transaction monitoring broadly, the challenge is maintaining recall while improving precision from the historically poor levels produced by legacy rule-based systems.

The asymmetry is worth stating plainly. Low precision is an efficiency problem. Low recall is a compliance problem. Regulators don't fine banks for reviewing too many alerts. They fine banks for missing crimes.

When documenting model performance for internal audit or an examiner, the minimum expected output is:

  1. Precision and recall at the current operating threshold
  2. The full precision-recall curve showing performance across all possible thresholds
  3. The rationale for the chosen threshold, tied to the acceptable false-negative rate for each typology covered
  4. A review cadence, ownership, and escalation path when performance degrades

This documentation should exist before an exam, not be assembled after the request arrives.

Related questions

Related concepts and regulations


← All compliance questions