Tool Configuration Process
Once a tool has been chosen for the patient matching project, the tool needs to be configured to identify matches in the most optimal way for the data characteristics of the project.
In this section, we present a structured process to test and evaluate the tool until the optimal configuration is found. This process can be followed the first time the project is set up as well as periodically to assure the ongoing quality of the matches.
1. Data quality analysis
For this first step, the data captured needs to be analysed to identify:
- what identifiers to use from all the captured data
- what are the characteristics and quality of those identifiers and what type of errors are present when capturing the data
- what pre-processing is required before all records, from different data sources can be compared
For more information view the Data Quality Analysis section in the Patient Matching Algorithm Alternatives for varying Identifier Consistency DISI artefact.
2. Generate a test dataset
This information is then used to generate a test dataset. The test dataset can be based on real data or synthetic data and will include information about which record pairs are actual matches, and which are not. This is also called "ground truth".
Real data test dataset
If real data is used, all the possible record pairs need to be reviewed to manually assess if they are matches or not. To assess a large enough sample size, a team is often needed. A set of rules should be established for the individuals in the team to follow, and if there is an ambiguity, a group/executive decision should be made.
Synthetic test dataset
If a synthetic dataset is used, the ground truth is part of the generation process. The data characteristics and errors found during data analysis, need to be mirrored in the synthetic dataset. The information below can be obtained from the analysis, to generate a synthetic dataset representative of the context:
- Frequency distributions for names, surnames, cities, etc. that are country specific.
- Maximum number of modifications per record and per identifier
- Maximum number of duplicates per record
- Probability of a specific error per identifier e.g. percentage of missing values in city, percentage of typing errors in given name, etc.
For more information on how to generate a synthetic dataset, view the Dataset Generation section in the Patient Matching Algorithm Alternatives for varying Identifier Consistency DISI artefact. To generate your own dataset, view the Dataset Generator Notebook created by Jembi.
3. Test and evaluate using the Evaluation tool
Once a test dataset is generated, it is used as the input to the patient matching tool, to generate matching scores for each record pair.
Depending on the size of the dataset, blocking might be required to reduce the number of records to compare, so that the tool has sufficient processing resources to calculate all the matching scores.
For more information and ways to test different blocking strategies, view the Blocking Notebook created by Jembi.
The tool then classifies the pairs as matches and non-matches. This classification is compared against the “ground truth” and accuracy measuring values as precision, recall and f-score are calculated.
Different sets of configurations are tested and evaluated until the optimal configuration is obtained. Each tool will have its own set of parameters to be configured.
To test this step view the Evaluation tool Fastlink R Notebook created by Jembi.
4. Test and evaluate using the Production tool
Once an optimal configuration is obtained based on the test dataset, the production tool is configured and run with live data to do the final evaluation. If live data is not available, a different test dataset can be used.
To do this, a manual review is needed. A random sample is drawn from the list of all possible matches identified during the blocking phase. One or more reviewers can determine whether each possible match is a match or a non-match. Finally, the decisions documented during the manual review process are used as the “golden truth” against which the decisions made by the tool are compared, allowing for the calculation of recall and precision rates.
If the evaluation is successful and the run has good recall and precision rates, this configuration is selected as optimal. If not, Step 1 is run again, analysing the type of errors present on fields where false negatives and false positive errors are found. Step 2 is then run again by introducing or updating the dataset used by adding examples of data characteristics that were not identified as matches by the tool. This way, these newfound errors can be taken into account.
This process continues until the optimal configuration is found.