Sections

Understanding the Core Problem

Overview

In this section we describe what is patient matching, and why it is important and also difficult to implement.

What is patient matching?

Patient matching is the process of identifying and linking of a patient's data within and across different health systems to obtain a comprehensive view of that patient's healthcare record.

Patient identification has been defined as “the process of correctly matching a patient to intended care and treatment and then also communicating that information about the patient’s identity accurately and reliably throughout the continuum of care.” Patient identification occurs from the first patient encounter throughout the care continuum and it includes the physical identification of a patient as well as technologies to enhance the accuracy of patient identification.

Why is it important?

Patient identification matching problems are a major contributor to data integrity issues within electronic health records. These issues impede the improvement of healthcare quality through health information exchange and care coordination and contribute to deaths resulting from medical errors. Despite best practices in the area of patient access and medical record management to avoid duplicating patient records, duplicate records continue to be a significant problem in healthcare.

Why is patient matching difficult to implement successfully?

Patient matching is challenging in general because considerable effort is required to uniquely identify an individual every time they interact with the health infrastructure. Ideally, each record obtained should be mapped to the specific individual to whom it applies so that all prior records can be integrated. These challenges can be complicated by the situations described below:

Nonintegrated systems

If different aspects of the health system are built on different environments there may be no unique identifier that applies across these different environments. As such all matching between systems will require unique keys not specific to the platform itself.

Lack of universal identifiers

In contexts where the health system or the country has a unique universal identifier (e.g., National ID, National Health ID) then this provides a useful key for all systems. However many countries do not have a single universal ID and as such other identifiers are needed.

Lack of adequate skill and incentive when capturing data

Maintaining a reliable patient record system required that individual administrators at the health sites make the effort to consistently record the critical information each time the patient enters the health system. Ideally, the health care administrator would:

confirm whether the individual already exists on the system
update and correct the central records should they have changed, and
identify any problems with duplicate records which can ultimately be fixed manually.

This requires both intensive effort and skill and may not be a high priority in a busy and fast-moving health environment.

Privacy concerns

Where privacy regulations are in place, health records are usually considered extremely high-risk information. This creates a burden on the health system to:

limit access to critical information to as few users as possible;
limit the unnecessary integration of systems where this may create a security risk;
minimize the information recorded to only what is absolutely necessary, and
reduce the time for which records are stored.

This may reduce the ability of systems to make use of probabilistic matches as the relevant information may not be available for matching.

Changing data fields

The information which assists in the identification of information is subject to change. Names, contact information, geolocations, and employment change over time. The greater the distance in time between when records are captured and when they are used for comparison, the more likely that the underlying information may have changed, thus reducing the ability for these records to be useful in matching data.

How to improve patient identification?

There are a variety of methods and approaches to addressing patient identification:

Operational processes

Implementation of standard operational processes and procedures in all facilities across the country. Some examples are:

Upon admission, use at least 2 identifiers to verify a patient’s identity. Consider including photographs to be taken at registration and incorporating these into patient medical records so that they are visible to all clinicians across the country, etc.
Implement standard processes for how staff record certain demographic data attributes.
Implement guidelines for identifying patients without identification or with the same name.
Adopt a protocol for dealing with patients who cannot immediately identify themselves, e.g., emergency room cases, patients who are comatose, etc.; consider the use of non-verbal approaches such as biometrics, etc.

Technological approaches:

Use of person unique identifiers (minimum of 2) for matching - this increases the likelihood of accurately matching patients to their health data.
Use of automated identification procedures at the source, e.g., voice and other biometric systems. This may include app-based and online booking platforms.
Use of algorithmic approaches which makes use of demographic characteristics to match patients to their health data. This involves a unique identifier coupled with a limited number of demographic identifiers. It is important that demographic identifiers are standardized so that matching algorithm accuracy can be optimized.

The generic patient matching process

The diagram below shows the generic process followed to do patient matching:

Steps:

Database A: This database includes patient demographic data from different source systems.
Pre-processing: Data is pre-processed to improve the data quality and make sure that records captured in different systems can be compared to each other. This includes standardising date formats and coded value sets, transforming fields, removing special characters, etc.
Blocking: Blocking is the process of identifying pairs that are plausible matches and discarding the others. This is necessary to reduce the total number of possible matches due to processing power limitations. Matching algorithms are used to do the comparisons. It is a constant process of checking and adjusting the blocking criteria and algorithms used to minimise false negative rates whilst trying to maximise efficiency. It is important to block carefully, only using identifiers that are likely to be quite stable across all sources.
Comparisons: After blocking, all the remaining records are compared against each other. This process can be very resource intensive. Algorithms are used to compare each record identifier and the outcome of each comparison is aggregated to generate a record pair matching score. The way the matching score is calculated will depend on the patient matching approach used.
Classification: Based on a pre-defined threshold that is obtained in the Evaluation step, the record pairs are classified as matches (very high probability that they are identical), non matches (very unlikely to not be the same individual), and potential matches (probability falls in a band of uncertainty).
Clerical review: For record pairs that are potential matches, they may be reviewed manually and become a match or non-match or remain as potential matches.
Evaluation: The degree to which the classification algorithm correctly identifies matches are measured and optimised for the expected data. In particular, the system must minimise the ratios of the potential errors that may occur. The critical measures considered are recall (measure of the capability of the matching algorithm to correctly classify two records that refer to the same patient as true matches) and precision (measure of the fraction of pairs that the matching algorithm classified accurately as matches). These accuracy measurements are described in detail below.

The nature of the problem is such that, if for a given algorithm, recall is optimised then precision reduces and vice-versa. These figures must be optimised according to the project requirements.

Patient matching approaches

The Comparison step can be implemented in many different ways. These are the main approaches:

Deterministic matching: Rules-based or deterministic matching uses a one-to-one comparison approach for record linking. Deterministic matching compares exact character-by-character values in fields to determine whether two records should be linked. This strict algorithm does not account for spelling variations or differences in using abbreviations, such as St. versus Street. It should be noted that many consumer-facing applications use deterministic matching over probabilistic, as it is necessary to only return one exact consumer record to that specific record. On the other hand, provider-facing applications can return numerous records, for the provider to then choose the appropriate patient record. It should be noted that even the first academic paper on record linkage recognized that strict deterministic matching is inadequate.
Probabilistic matching: also referred to as fuzzy matching, uses statistical analysis to determine the overall likelihood or probability that two records match. It is generally accepted that probabilistic matching is a more sophisticated approach to record linking than deterministic matching. Probabilistic matching can take phonetic sounds and nicknames into account as well as “edit distances” and “string similarities” when there is variation in fields. For example, probabilistic matching will recognize that records with a first name of Tim and Timothy may belong to the same individual and rank higher probability for similar sounding names like White and Wight. Finally, the probabilistic algorithm will often give a higher probability to 123 Main Street vs. 163 Main Street, recognizing the edit distance between the two as one-character difference.
Machine learning: There are two broad approaches to patient matching with machine learning. The first is known as “supervised” machine learning. Supervision refers to the role of the human operator: in supervised machine learning, the operator classifies a modest number of pairs as matches or not, using his or her judgement. This is known as “training” the algorithm.

Unsupervised machine learning does not require correctly classified data to train the system. The algorithm is able to infer the optimal way to match solely from the distribution of field scores across possible pairs. An example of this is the expectation maximization algorithm (EM) used to optimise the Felegi-Sunter model identifier weights.

Measuring patient matching accuracy

Key measuring concepts

The following table helps in visualizing the measures that are described below. The rows describe the classification of a record pair in reality as being either a match or a nonmatch, whereas the columns indicate the classification decision of the matching algorithm.

True Positive: Refers to the correct classification by the matching algorithm of two patient records as a match when both records refer to the same person.
False Positive: Also referred to as a Type I error, refers to a classification error by the matching algorithm where a record pair is marked as a match but in reality, the two records refer to two distinct patients.
False Negative: Also referred to as a Type II error, refers to a classification error by the matching algorithm where the two patient records are marked as referring to two distinct patients but in reality, the two records refer to the same person.
True Negative: Refers to the correct classification by the matching algorithm of two patient records as a non-match when the two records refer to two different patients.

The following metrics build on top of the basic metrics described previously. They are commonly used to evaluate the performance of matching algorithms and configuration changes to those algorithms.

Recall: Also referred to as sensitivity, is a measure of the capability of the matching algorithm to correctly classify two records that refer to the same patient as true matches. It is calculated as the ratio of the number of true positives divided by the sum of true positives and false negatives.

Recall = \frac{TP}{TP + FN}

Precision: Also referred to as specificity, is a measure of the fraction of pairs that the matching algorithm classified accurately as matches. It is also called the positive predictor value and can be used along with recall to jointly evaluate the performance of a matching algorithm. It is calculated as the ratio of the number of true positives divided by the sum of true positives and false positives.

Precision = \frac{TP}{TP + FP}

It is important that two metrics are used in the analysis because the metrics evaluate the performance of the matching algorithms from different and conflicting viewpoints resulting in a more balanced assessment as opposed to using just a single metric.

F-score: is a measure used to represent both precision and recall in a single value. It is calculated as the harmonic mean of the precision and recall.

F-score = 2 *\frac{precision * recall}{precision + recall}

Applying these concepts

To be able to calculate these values, the actual matches and non-matches need to be known. This information is also called "ground truth". There are two ways of obtaining these values:

by using a dataset generator tool that generates original records and duplicates with the ground truth known
by manually reviewing real-life data record-pairs one by one and assigning a match or non-match status

References

Next Project Profiling and Choosing Tools

Understanding the core problem