Record Linkage: Definition and Use in Compliance
Record linkage is a data processing technique that identifies when records across separate databases represent the same real-world entity, such as a customer or corporate counterparty, to support identity verification and financial crime screening.
What is Record Linkage?
Record linkage (also called entity resolution) is the process of identifying records in two or more separate data sources that describe the same real-world entity: a person, business, or financial account. In a typical bank, customer data lives across multiple systems: a core banking platform, an onboarding system, a CRM, a Customer Due Diligence (CDD) repository, and a Transaction Monitoring engine. Without record linkage, "J. Smith," "John Smith," and "Jonathan M. Smith" in those systems are three separate entities. The compliance team has no reliable way to know which records belong together, and that ambiguity is where risk hides.
The technique operates in two modes. Deterministic linkage uses an exact shared identifier: a passport number, a tax identification number, a Legal Entity Identifier (LEI). When two records share that key, they link. This is fast and clean, but fails when the key is missing, inconsistently formatted, or deliberately withheld. Probabilistic linkage handles the rest. The system computes similarity scores across multiple fields (name, date of birth, address, phone number, nationality) and flags record pairs that exceed a defined confidence threshold as likely matches. A name with 92% Jaro-Winkler similarity plus a matching date of birth scores higher than a name match alone. Production systems typically run deterministic matching first, then fall through to probabilistic scoring for the remainder.
The output is a Golden Record: a single, consolidated profile built from all linked records. That profile is what analysts review when deciding whether to file a Suspicious Activity Report (SAR) or escalate a case for further investigation.
Howard Newcombe introduced probabilistic record linkage in a 1959 paper in Science. Fellegi and Sunter formalized the decision framework mathematically in a widely cited 1969 paper in the Journal of the American Statistical Association. Financial compliance adopted the technique in earnest through the 1990s and 2000s, driven by AML regulations that required consolidated customer views across accounts and products.
How is Record Linkage used in practice?
In a working compliance program, record linkage runs continuously, not just at onboarding.
At onboarding, the first check is whether the incoming customer already exists in the database under a different identifier or product line. A customer who holds a business account under a corporate name and then applies for a personal account shouldn't generate a second, unconnected Know Your Customer (KYC) file. Catching that link at the start keeps the risk profile consolidated and prevents two separate KYC reviews from reaching conflicting conclusions about the same person.
During ongoing monitoring, record linkage connects payment activity to the correct customer profile. International wire transfers carry names transliterated from Arabic, Cyrillic, or Chinese scripts, and any transliteration system produces multiple valid spellings. "Ahmed Al-Rashidi" and "Ahmad Alrashidy" might be the same customer. Phonetic encoding and fuzzy scoring make that determination. Without it, the transaction history attached to each spelling is incomplete.
Sanctions Screening is where record linkage has the highest operational stakes. Screening against the OFAC SDN list or the UN Consolidated Sanctions List relies on Fuzzy Matching to catch near-matches. But that matching is only as good as the underlying entity profile. If a sanctioned party controls three companies in your database, and those companies aren't resolved to a single entity, the screens run in isolation and miss the cumulative pattern.
Record linkage also runs at the case level. When an investigator maps the network of connections around a suspicious actor, they need accurate entity resolution as a foundation. A graph where "J. Smith" and "John Smith" appear as separate nodes produces a fragmented picture that misses connections a human analyst would immediately recognize. We've seen institutions run batch resolution projects before regulatory examinations and discover that thousands of "distinct" customers were actually duplicates, with parallel alerts running on the same underlying person.
Record Linkage in regulatory context
No regulation uses the phrase "record linkage" directly. The obligation to maintain accurate, consolidated customer records runs through almost every major AML framework, and record linkage is the operational mechanism that makes compliance possible at scale.
FATF Recommendation 10 requires financial institutions to conduct Customer Due Diligence (CDD) and maintain records sufficient to reconstruct individual transactions and verify customer identity. The Financial Action Task Force (FATF) guidance on beneficial ownership, updated in 2023, goes further: institutions must identify the Ultimate Beneficial Owner (UBO) of legal entities and maintain records linking each UBO to every account and transaction they control. Without automated record linkage, that requirement is unworkable at any meaningful volume.
In the United States, the FinCEN CDD Rule (31 CFR Part 1010, effective May 2018) requires covered institutions to identify and verify beneficial owner identity for legal entity customers. A bank with 500 legal entity customers, each with up to four beneficial owners, needs to resolve those individuals against its existing customer database. Without automated linkage, that's a manual process that takes weeks and produces inconsistent results across analysts.
The European Banking Authority's guidelines on internal governance (EBA/GL/2021/05) require institutions to maintain accurate, auditable data on their customers, with clear data lineage from source to decision. Record linkage is the mechanism that creates that lineage when data is distributed across siloed systems with different schemas and update cycles.
For Politically Exposed Person (PEP) screening and Adverse Media monitoring, high false positive rates are often a record linkage failure rather than a screening algorithm failure. When customer profiles are fragmented, screening engines can't confidently match or clear hits. Analysts end up reviewing the same underlying entity multiple times under different identities, burning review hours on what is functionally the same person.
Common challenges and how to address them
The hardest part of record linkage isn't the algorithm. It's the data going into it.
Name variation is the first obstacle. "Mohammed Al-Qahtani" enters a system through a dozen transliteration paths depending on the originating country's conventions. "María García" loses her diacritics somewhere in a data pipeline and becomes "Maria Garcia." These variants produce separate records that deterministic matching treats as distinct persons. The fix is name normalization before matching: strip diacritics, expand abbreviations, apply phonetic encoding (Soundex or Metaphone), then run probabilistic scoring on the normalized forms. This adds processing time. The accuracy gain is worth it.
Address data is equally inconsistent. "123 Main Street," "123 Main St," and "123 Main St., Apt 4B" are the same address, but a naive string comparison treats them as three distinct ones. Address parsing against a reference dataset (USPS, Royal Mail Postcode Address File, or a commercial geocoder) standardizes these before they reach the matching engine.
Threshold calibration is where institutions struggle most. Setting the match confidence threshold too high produces false negatives: two records for the same entity stay separate. Set it too low, and distinct entities merge. A threshold optimized for 95% precision might deliver only 65% recall on your actual data. The right threshold is specific to your data distribution, not a value borrowed from a benchmark in a different industry. Treat it like a model parameter: calibrate it against labeled examples from your own records, test it against holdout data, and review it annually.
The False Positive and False Negative trade-off doesn't disappear. It gets managed through calibration, ongoing monitoring, and human review of borderline cases.
Data governance is the systemic fix. Record linkage quality degrades without it. If source systems allow inconsistent data entry at the point of collection, the matching engine is fighting constant new fragmentation. Treating record linkage as a one-time remediation project rather than an ongoing control means data drifts back toward fragmentation within 12-18 months. The control needs to run continuously.
Related terms and concepts
Record linkage sits at the center of a cluster of terms that compliance teams, data engineers, and vendors use in overlapping ways. They're related but not interchangeable.
Entity Resolution is the broader concept: determining that two references in data point to the same real-world entity, including disambiguation of records that seem similar but actually represent different people. Record linkage is one technique within entity resolution, focused specifically on matching across separate data sources with different schemas and formats.
Deduplication is a narrower operation: removing exact or near-exact copies of the same record within a single database. Banks run deduplication as a data quality step, but it doesn't address the harder problem of matching records across systems with different data models, different collection standards, and different update frequencies.
Fuzzy Matching is the underlying calculation mechanism. Algorithms like Jaro-Winkler distance, Levenshtein distance, and n-gram overlap measure how similar two strings are and produce a score. Record linkage is the decision framework that acts on that score: "above 0.92 on this field combination, link the records."
The Golden Record is the output. Once records are linked, the consolidated entity profile feeds into Know Your Business (KYB) processes for corporate customers, Enhanced Due Diligence (EDD) workflows for high-risk counterparties, and Case Management systems where investigators need a complete view of a subject.
Network Analysis depends on record linkage accuracy. A relationship graph where "J. Smith" and "John Smith" appear as separate nodes produces a fragmented picture. The connections visible to the analyst are a subset of the actual connections. Record linkage quality is, in this sense, a prerequisite for any graph-based financial crime detection. You can't map relationships between entities if you don't first know which records represent the same entity.
Where does the term come from?
The phrase "record linkage" was introduced by H.B. Newcombe in a 1959 paper in Science that described a method for linking vital records across government health databases. Ian Fellegi and Alan Sunter gave the field a formal mathematical foundation in 1969 with a probabilistic decision framework published in the Journal of the American Statistical Association. Financial compliance adopted the concept in the 1990s as regulators began requiring consolidated customer records under early AML frameworks. The alias "entity resolution" emerged from the computer science community in the 2000s and is now used interchangeably in compliance technology, particularly in discussions of graph databases and network analysis.
How FluxForce handles record linkage
FluxForce AI agents monitor record linkage-related patterns in real time, flag anomalies for analyst review, and generate evidence-backed decisions with full audit trails.