ONC Patient Matching Challenge: Part 2

This is the second of a two part series on tackling the ONC Patient Matching Challenge. In the first part we went over the background and high level approach while in this part we cover the matching engine that was built (and how to use it). As a note, as of this blog post Mindful Machines is in fifth place in the challenge using this approach. As part of this blog post series all the code and data used to achieve our results have been open sourced.


The technologies used to build the matching engine are:

  • Scala
  • Play
  • Spark
  • PostgreSQL
  • Docker

This is partially a proof of concept of using Scala for end-to-end data processing and machine learning.


Follow the directions on the project github. Once you go to localhost:9000 you should see this:

Running Blockers

The code comes with a decent set of initial blockers and the first step is to run them. This caches the blockers to disk and populates sample rows for labeling in the database. You can do this by clicking the blockers link. This will give you a page like this:

Simply select all the blockers and click run. It will take around an hour to generate the blockers so feel free to get some lunch.


The code comes with a set of labeled data (called Initial) so this step is optional.

The next step would be to label potential matches as either a match or not a match. You can do this by going to the main page http://localhost:9000/ and selecting a blocker from the drop down list then clicking comparison.

This will give you a page like this, simply enter a name for this batch of labels and keep going. Everytime you click Match or No Match another random potential match will be shown for you to label.

Building a Model

The next step is to build a model, from the home page go to models and then click Create at the bottom. This will give you a page like this.

Fill out the fields in this manner for this test run which gives the best performance we've seen so far:


This step is optional however if you want to check if any labels were potential mistakes you can use the model predictions as a benchmark. Simply go to models from the home page and click update for the model you wish to edit. This will open up a comparison page like this:

Simply click Match or No Match until you run out of mis-matched results.


Now that you have a trained model you can create a submission for the contest. Simply click submissions from the home page and then Create. Now enter information as below to mimic the best model we've built so far:

Now click run. The model will be in data/submission and you'll see a page where you can enter the F1, Precision and Recall from the website.

Creating Blockers

To create a new Blocker will require getting into the code base. Go to library/src/main/scala/oncpmc/blocking/ and create a new class. For example, let's say we want to block on SSN, filter out obviously invalid SSN, and not have the first name of patients match. You'd write a class such as this:

class SSNFilteredBlocker extends BlockerBase {
  override val name: String = "SSNFiltered"

  override def filterPair(p1: Patient, p2: Patient): Boolean = {
    p1.first != p2.first
  override def filter(r: Patient): Boolean = {

    r.ssn.nonEmpty &&
      r.ssn.get.replaceAllLiterally("-","").toCharArray.distinct.length > 1 &&
      r.ssn.get.replaceAllLiterally("-","") != "123456789"

  override def group(r: Patient): String = {

Now go to library/src/main/scala/oncpmc/helpers/Conf.scala and add your class to the list.

Creating Features

Similar to Blockers, creating features requires going into the code. Go to library/src/main/scala/oncpmc/model/ and create a new class that extends FeatureBuilder. Now go to library/src/main/scala/oncpmc/helpers/Conf.scala and add your class to the list.


Thanks to our very own Zoë Frances Weil for her invaluable insight and contributions.