ONC Patient Matching Challenge: Part 2
This is the second of a two part series on tackling the ONC Patient Matching Challenge. In the first part we went over the background and high level approach while in this part we cover the matching engine that was built (and how to use it). As a note, as of this blog post Mindful Machines is in fifth place in the challenge using this approach. As part of this blog post series all the code and data used to achieve our results have been open sourced.
Technology
The technologies used to build the matching engine are:
- Scala
- Play
- Spark
- PostgreSQL
- Docker
This is partially a proof of concept of using Scala for end-to-end data processing and machine learning.
Installing
Follow the directions on the project github. Once you go to localhost:9000
you should see this:
Running Blockers
The code comes with a decent set of initial blockers and the first step is to run them. This caches the blockers to disk and populates sample rows for labeling in the database. You can do this by clicking the blockers link. This will give you a page like this:
Simply select all the blockers and click run. It will take around an hour to generate the blockers so feel free to get some lunch.
Labeling
The code comes with a set of labeled data (called Initial
) so this step is optional.
The next step would be to label potential matches as either a match or not a match. You can do this by going to the main page http://localhost:9000/
and selecting a blocker from the drop down list then clicking comparison.
This will give you a page like this, simply enter a name for this batch of labels and keep going. Everytime you click Match or No Match another random potential match will be shown for you to label.
Building a Model
The next step is to build a model, from the home page go to models
and then click Create
at the bottom. This will give you a page like this.
Fill out the fields in this manner for this test run which gives the best performance we've seen so far:
Re-labeling
This step is optional however if you want to check if any labels were potential mistakes you can use the model predictions as a benchmark. Simply go to models
from the home page and click update
for the model you wish to edit. This will open up a comparison page like this:
Simply click Match or No Match until you run out of mis-matched results.
Submission
Now that you have a trained model you can create a submission for the contest. Simply click submissions
from the home page and then Create
. Now enter information as below to mimic the best model we've built so far:
Now click run. The model will be in data/submission
and you'll see a page where you can enter the F1, Precision and Recall from the website.
Creating Blockers
To create a new Blocker will require getting into the code base. Go to library/src/main/scala/oncpmc/blocking/
and create a new class. For example, let's say we want to block on SSN, filter out obviously invalid SSN, and not have the first name of patients match. You'd write a class such as this:
class SSNFilteredBlocker extends BlockerBase { override val name: String = "SSNFiltered" override def filterPair(p1: Patient, p2: Patient): Boolean = { p1.first != p2.first } override def filter(r: Patient): Boolean = { r.ssn.nonEmpty && r.ssn.get.replaceAllLiterally("-","").toCharArray.distinct.length > 1 && r.ssn.get.replaceAllLiterally("-","") != "123456789" } override def group(r: Patient): String = { r.ssn.getOrElse("") } }
Now go to library/src/main/scala/oncpmc/helpers/Conf.scala
and add your class to the list.
Creating Features
Similar to Blockers, creating features requires going into the code. Go to library/src/main/scala/oncpmc/model/
and create a new class that extends FeatureBuilder
. Now go to library/src/main/scala/oncpmc/helpers/Conf.scala
and add your class to the list.
Acknowledgments
Thanks to our very own Zoë Frances Weil for her invaluable insight and contributions.