The Problem of Duplicate Patient Records
Duplicate records are confusing, time-wasting and potentially a dangerous situation for patient care.
These duplicates need to be consolidated into single sources of truth so that patients can be uniquely identified.
Resolving duplicate records by Patient Matching
Patient matching identifies and links a patient's data within and across health systems to create a comprehensive health record.
Patient identification is the process of matching a person with the care and treatment and sending that information across the continuum of care. From the first time a patient is seen to the end of their treatment, physical identification and technologies that improve accuracy are used to identify patients.
Why is it important?
Patient ID matching issues contribute to EHR data integrity challenges.
- These difficulties hinder health information interchange and care coordination, leading to medical error deaths.
- Problems sharing health information and coordinating care can lead to medical errors and patient deaths.
- Despite best practices for patient access and medical record management, duplicate records still occur.
Why is patient matching difficult?
Patient matching is difficult because each time a patient interacts with the health infrastructure, they must be individually identified. Each record should be mapped to the person it applies to so all preceding records can be combined.
The following can compound these challenges:
- Non-integrated systems: If different parts of the health system are built on different environments, there may not be a single identifier that works in all of them. So, any matching between systems will need keys that aren't tied to the platform itself.
- Lack of universal identifiers: When a country or health system has a unique universal identifier, like a National ID or a National Health ID, it gives all systems a key. Many countries don't have a single form of ID, so they need extra proof of identity.
- Inadequate data capture skills and incentives: For a health site administrator to keep a reliable patient record system, they must captureĀ important information every time a patient enters the system. Health care administrators should:
- Check if person is in system
- Update records if they've changed
- Duplicate records must be manually fixed.
- Privacy concerns: Health records are high-risk information where privacy rules exist. The health system must:
- restrict access to key information
- Limit unnecessary security-risking system integrations
- Record only necessary information
- Reduce record-keeping time.
- Changing data fields: Identification information can change. Names, contact info, geolocations, and jobs change. The longer it's been since records were gathered and compared, the more probable the underlying information has changed, reducing their relevance.
However, there may now not be enough relevant data for systems to use probabilistic matches.
Generic matching process diagram
Steps:
- Database A: This database includes patient demographic data from different source systems.
- Pre-processing: Data is pre-processed to improve the data quality and make sure that records captured in different systems can be compared to each other. This includes standardising date formats and coded value sets, transforming fields, removing special characters, etc.
- Blocking: Blocking is the process of identifying pairs that are plausible matches and discarding the others. This is necessary to reduce the total number of possible matches due to processing power limitations. Matching algorithms are used to do the comparisons. It is a constant process of checking and adjusting the blocking criteria and algorithms used to minimise false negative rates whilst trying to maximise efficiency. It is important to block carefully, only using identifiers that are likely to be quite stable across all sources.
- Comparisons: After blocking, all the remaining records are compared against each other. This process can be very resource intensive. Algorithms are used to compare each record identifier and the outcome of each comparison is aggregated to generate a record pair matching score. The way the matching score is calculated will depend on the patient matching approach used.
- Classification: Based on a pre-defined threshold that is obtained in the Evaluation step, the record pairs are classified as matches (very high probability that they are identical), non matches (very unlikely to not be the same individual) , and potential matches (probability falls in a band of uncertainty).
- Clerical review: For record pairs that are potential matches, they may be reviewed manually and become a match or non-match or remain as potential matches.
- Evaluation: The degree to which the classification algorithm correctly identifies matches are measured and optimised for the expected data. In particular, the system must minimise the ratios of the potential errors that may occur. The critical measures considered are recall (measure of the capability of the matching algorithm to correctly classify two records that refer to the same patient as true matches) and precision (measure of the fraction of pairs that the matching algorithm classified accurately as matches). These accuracy measurements are described in detail below.
The nature of the problem is such that, if for a given algorithm, recall is optimised then precision reduces and vice-versa. These figures must be optimised according to the project requirements.
Patient matching approaches
The Comparison step can be implemented in many different ways. These are the main approaches:
- Deterministic matching: Rules-based or deterministic matching uses a one-to-one comparison approach for record linking. Deterministic matching compares exact character-by-character values in fields to determine whether two records should be linked. This strict algorithm does not account for spelling variations or differences in using abbreviations, such as St. versus Street. It should be noted that many consumer-facing applications use deterministic matching over probabilistic, as it is necessary to only return one exact consumer record to that specific record. On the other hand, provider-facing applications can return numerous records, for the provider to then choose the appropriate patient record. It should be noted that even the first academic paper on record linkage recognized that strict deterministic matching is inadequate.
- Probabilistic matching: also referred to as fuzzy matching, uses statistical analysis to determine the overall likelihood or probability that two records match. It is generally accepted that probabilistic matching is a more sophisticated approach to record linking than deterministic matching. Probabilistic matching can take phonetic sounds and nicknames into account as well as āedit distancesā and āstring similaritiesā when there is variation in fields. For example, probabilistic matching will recognize that records with a first name of Tim and Timothy may belong to the same individual and rank higher probability for similar sounding names like White and Wight. Finally, the probabilistic algorithm will often give a higher probability to 123 Main Street vs. 163 Main Street, recognizing the edit distance between the two as one-character difference.
- Machine learning: There are two broad approaches to patient matching with machine learning. The first is known as āsupervisedā machine learning. Supervision refers to the role of the human operator: in supervised machine learning, the operator classifies a modest number of pairs as matches or not, using his or her judgement. This is known as ātrainingā the algorithm.
Unsupervised machine learning does not require correctly classified data to train the system. The algorithm is able to infer the optimal way to match solely from the distribution of field scores across possible pairs. An example of this is the expectation maximization algorithm (EM) used to optimise the Felegi-Sunter model identifier weights.