This is the first of a two part series on tackling the ONC Patient Matching Challenge. The first part reviews background and high level topics while the second part covers the matching engine that was built (and how to use it). As a note, as of this blog post Mindful Machines is in fifth place in the challenge using this approach. As part of this blog post series all the code and data used to achieve our results have been open sourced.
The goal of the Patient Matching Challenge is to determine if two medical patients are the same person or if they are different people based on their demographic data. For example:
Are these patients two individuals or the same individuals? As the example illustrates even in someone that we'd think of as the same there are many ways the raw data fields can differ due to things such as nicknames, data entry bugs, missing data, etc.
The fields we have access to are Last Name, First Name, Middle Name, Suffix, DOB, Gender, SSN, Address, Mother's Maiden Name, Medical Record Number, Phone, Email, Aliases.
The data set in question has 1 million patient records and was generated artifcially but based on a real data set. There are no labels in the data and there is no training labeled data set provided. Part of the challenge is as a result to generate your own training data set.
The final submission for the contest are pairs of patients that we believe match which will be evaluated against a golden data set based on F1 score (precision and recall will also be provided back for submissions).
There's a number of challenges that come with working with this data.
The various fields are often missing outright.
|Field||Percentage Values Missing|
There are a non-trivial number of cases of values such as 111-11-1111 for SSN or 123-456-7890 for phone number.
Given the above one could think that after filtering out obviously invalid data, a field such as SSN could be used for a first pass. Unfortunately, there are far more invalid matches on SSN than there are valid matches. In other words, "fake" SSN numbers and by extension false positive matches are very common in the data.
Data Entry Errors
There's a number of data entry errors such as digits being switched around or changed. For example, the SSN 45-23-4567 may come in as 45-23-4566.
In addition the the above, in many cases fields that to a human are the same have different raw values in them. For example, Jim and James. Addresses are also plagued with this since there's so many ways to write the same actual address (East, E, Street, St, etc.).
Changes Over Time
This data set is not from a single point in time but from records that would come in over a long period of time. As a result people's phone numbers, addresses, genders and even names can change over time.
Medical Record Number
The Medical Record Number (MRN) field is never identical across two patients. However, very often the values are close (+/- 5000) for the same individual.
The basic approach is broken down into a few linked components. First we generate training data by manually labeling potential matches. Then we use this training data to generate a supervised machine learning model. Finally we use this model on all the data (labeled and not) to generate potential matches.
The naive approach would be to first generate all possible combinations of patient records and then to somehow score how well they each match. However, given that we have 1 million patient records the total number of cominations would be quiet large. As a result, we instead first block the data into smaller pieces and within each piece we generate all possible combinations. For example, we might block by SSN and then for all records that have the same SSN generate all pair-wise combinations for scoring.
Some additional considerations when blocking, especially given the next part, is to filter our invalid data and potentially to even filter out pairs which are already covered by a different blocker.
Now that we have used blocking to generate potential matches the next step is to score those matches to generate a training set of data. This is a laborious process and this is why you want the blockers to filter out as many unnecessary or plainly invalid matches as possible.
Now that we have a training set of data the next step is to build a model using it. We could have used a heuristic or manual approach however a model is more flexible and allows the approach to be extended to arbitrary data more easily.
The feature set for the model are mainly levenshtein distances between fields for the two records under comparison. However in other cases hamming distances was used (for fixed width fields such as SSN), set comparison (phone numbers) or numeric differences (MRN). Given that there is a strong interaction effect between fields and a relative lack of data the best model seemed to be a relatively small Random Forests model.
The labeling process being manual is fraught with potential errors and sometimes the model is better than we are. As a result, we want to review any mismatched the model has to ensure it's an issue with the model rather than the labels we put on the data.
Finally we generate all potential matches from our blockers, score them using our model and submit the results.
The best F1 score on a submission for Mindful Machines is 0.978 with an precision of 0.991. The offline model performance is a F1 of 0.995 and a precision of 0.994. That implies there's two causes for the lower submission score:
- The training labels are overly pessimistic and mark actual matches as not matches
- The blockers we have are not sufficient and we're missing some subset of potential matches
Thanks to our very own Zoë Frances Weil for her invaluable insight and contributions.