AI governance

Precision: Definition and Use in Compliance

Published: Last updated:

Precision is a classification model metric that measures the proportion of positive predictions that are correct, calculated as true positives divided by the sum of true positives and false positives.

What is Precision?

Precision answers one question: when your detection system says "flag this," how often is it right? It's calculated as true positives divided by the sum of true positives and false positives. A model that raised 2,000 alerts last month, 400 of which led to Suspicious Activity Reports, has 20% precision.

That 20% figure isn't unusual. Across financial institutions using rule-based transaction monitoring, industry reviews and regulatory examination findings have consistently documented precision rates below 10%. At a mid-sized bank processing 50,000 alerts monthly at 8% precision, analysts close roughly 46,000 alerts per month on customers who don't warrant filing. That's the operational reality most compliance programs are working against.

The confusion matrix makes precision concrete. Four cells: true positives (flagged and genuinely suspicious), false positives (flagged but clean), false negatives (missed despite being suspicious), and true negatives (not flagged and clean). Precision only concerns the top row. Of everything you flagged, how many were real?

What precision doesn't capture is equally important. A model with 95% precision could be missing 80% of actual financial crime. Precision says nothing about what you didn't catch. That's the domain of recall, and the two metrics trade off directly. Every compliance program has to decide where it sits on the precision-recall curve, and that decision should be explicit, documented, and defensible to a bank examiner.

A concrete example: a regional bank ran a rules-based AML system generating 5,000 alerts per month at 4% precision. Two hundred led to SARs. The remaining 4,800 were time spent on nothing. After recalibrating with properly tuned thresholds, the bank processed 1,200 alerts monthly at 17% precision. SAR output held steady. The analyst backlog dropped from over 8,000 queued items to under 900 in six weeks.

How is Precision used in practice?

Compliance analysts interact with precision indirectly every day: every alert they close without filing is a data point that feeds the false positive count. But precision as an explicit metric lives in the model governance and analytics layer, where teams measure it, report it, and use it to drive threshold decisions.

The operational workflow usually looks like this. After any model change, an analytics team runs the updated model against a labeled historical dataset. They produce a precision-recall curve at different detection thresholds and identify candidate operating points. The point going to production reflects how many analyst hours are available per month, what the institution's risk appetite says about acceptable missed-detection rates, and what the last examination found.

Alert disposition coding is where precision gets measured in real time. When analysts close alerts, they select a disposition: "no suspicious activity," "escalated to case," "SAR filed," and so on. Institutions that log these codes consistently and link them back to model outputs can track precision on a weekly basis. Those that don't are operating blind between quarterly model validation cycles.

Threshold tuning is the primary lever for improving precision without full model retraining. Raising the score threshold at which an alert fires reduces false positives and improves precision, at the cost of some recall. This is a tradeoff, and it should be made intentionally. We've seen banks raise thresholds to cut analyst overtime without checking the effect on their false negative rate. That's how compliance programs generate enforcement findings.

The F1 score is increasingly the number that goes into board-level compliance reports: it summarizes the precision-recall balance in a single figure. But when an examiner sits down with your model validation team, they want precision and recall disaggregated, with documented rationale for why your deployed threshold is appropriate for your specific customer population.

Precision in regulatory context

Regulators don't publish minimum acceptable precision thresholds in statute. What they do is examine model governance documentation and ask whether institutions can demonstrate their detection systems are effective, calibrated, and appropriate for their risk profile. Low precision that can't be explained is a model risk management finding.

In the US, the Federal Reserve's SR 11-7 (2011) and the OCC's Bulletin 2011-12 jointly define model risk management expectations for US banks. Both require ongoing performance monitoring for models used in risk decisions, which includes transaction monitoring and fraud detection. Performance monitoring means tracking metrics like precision and recall, documenting when they change, and explaining why deployed thresholds remain appropriate.

The EU AI governance framework goes further. The EU AI Act (2024), available at eur-lex.europa.eu, classifies AML transaction monitoring as high-risk AI. Providers must document and disclose accuracy metrics, including precision and recall across population subgroups, specifically to detect bias that could cause disparate outcomes for protected groups.

The Basel Committee on Banking Supervision, through its model risk and supervision guidance series at bis.org, requires banks to monitor the performance of classification models used in risk decisions. This covers positive predictive value, which is the statistical synonym for precision, alongside sensitivity (recall). European Banking Authority guidelines under the EU's AML Directive package reference the same requirements, asking institutions to demonstrate that automated detection systems are both effective and operationally sound.

Enhanced Due Diligence programs have their own precision dynamic. When a monitoring system triggers an EDD review, each false positive consumes significant analyst time alongside relationship manager involvement and potential customer friction. A bank receiving 500 EDD triggers per month at 10% precision is spending resources on 450 customers who don't need enhanced scrutiny.

Common challenges and how to address them

The most common precision problem is a miscalibrated threshold that hasn't been updated since model deployment. Customer behavior changes. Product mixes shift. FedNow, RTP, and other real-time payment rails create transaction patterns the original training data never saw. A threshold calibrated in 2020 against a pre-pandemic customer base will typically produce 30-50% more false positives by 2024, with no change to the underlying model.

Class imbalance makes this worse. In fraud and AML datasets, genuine positive cases might represent 0.1-0.5% of transactions. A naive model trained without imbalance adjustments learns to predict "clean" for everything and achieves 99.5% accuracy at zero precision. Techniques like cost-sensitive learning and calibrated probability outputs address this, but they require deliberate configuration choices that many off-the-shelf detection systems don't apply by default.

Model monitoring frequency is the operational lever. Teams tracking precision weekly against a rolling labeled sample catch drift at week three or four. Those monitoring quarterly find out during a regulatory examination. The fix is the same either way, but the conversation with an examiner is considerably harder when the bank can't explain why a two-month precision decline went undetected.

Data quality is an underrated precision driver. Incomplete customer records from onboarding, missing transaction metadata from correspondent relationships, or stale business profile information from Know Your Business checks create information gaps that push models toward higher-uncertainty predictions. Those predictions skew toward false positives. A precision dashboard segmented by customer data completeness tier quickly identifies where the problem is concentrated.

The most defensible approach: model validation requirements that mandate precision and recall reporting at every quarterly review, with documented rationale for why the deployed threshold remains appropriate. This creates an evidence trail examiners respect and a feedback loop that catches problems before they become findings.

Related terms and concepts

Precision is part of a cluster of classification performance metrics that compliance officers should understand together, not in isolation.

Recall is precision's direct counterpart. Recall measures the fraction of actual positive cases that the model correctly identified: TP / (TP + FN). Where precision asks "how good are our alerts?", recall asks "how much are we actually catching?" The tradeoff between the two is the central design decision in every detection system.

The F1 score is the harmonic mean of precision and recall. A model with 95% precision and 5% recall gets an F1 of 9.5%, not 50%. That penalizing property makes the F1 score a genuinely useful single-number summary: it doesn't allow a model to sacrifice one metric entirely to maximize the other.

False positive and false negative are the building blocks. Precision is a function of how many false positives a model generates relative to true positives. Cutting false positives in half doubles precision, assuming true positives hold. Every analyst disposition code feeds directly into this ratio.

ROC AUC answers a different question from precision. ROC AUC measures a model's discrimination ability across all possible thresholds: can it rank positive cases higher than negative cases? It's a property of the model itself, independent of the deployed threshold. A model with high ROC AUC can still have poor operational precision if the threshold is misconfigured. The two metrics complement each other rather than duplicate.

Explainability connects to precision through the feedback loop. When a false positive fires, knowing why the model flagged the transaction is necessary for meaningful threshold tuning. Without an explanation, teams can adjust thresholds but can't identify the root cause. With one, they can distinguish between a threshold problem, a feature engineering gap, and a data quality issue. That distinction determines whether the fix takes a day or a quarter.


Where does the term come from?

Precision as a formal evaluation metric originates in Cyril Cleverdon's Cranfield experiments (1960-1966), which established precision and recall as the standard evaluation pair for information retrieval systems. Cleverdon defined precision as the proportion of retrieved documents that are relevant, the same formulation machine learning adopted directly.

The term entered financial services through the adoption of ML in transaction monitoring after 2008. The Federal Reserve's SR 11-7 (2011) and the OCC's Bulletin 2011-12 formalized model performance monitoring requirements in US banking, implicitly requiring precision tracking as part of ongoing model oversight. The EU AI Act (2024) elevated precision to an explicit disclosure requirement for high-risk AI systems in financial crime detection.


How FluxForce handles precision

FluxForce AI agents monitor precision-related patterns in real time, flag anomalies for analyst review, and generate evidence-backed decisions with full audit trails.

← Back to Glossary