The Flavors of Data Science and Engineering

Data Science means something different to everyone and is more of a marketing terms than a job description nowadays. That said, certain definition for it and the related disciplines are starting to emerge from what I've seen so I wanted to write down my perceptions.

Read More

ONC Patient Matching Challenge: Part 2

This is the second of a two part series on tackling the ONC Patient Matching Challenge. In the first part we went over the background and high level approach while in this part we cover the matching engine that was built (and how to use it). As a note, as of this blog post Mindful Machines is in fifth place in the challenge using this approach. As part of this blog post series all the code and data used to achieve our results have been open sourced.


The technologies used to build the matching engine are:

  • Scala
  • Play
  • Spark
  • PostgreSQL
  • Docker

This is partially a proof of concept of using Scala for end-to-end data processing and machine learning.


Follow the directions on the project github. Once you go to localhost:9000 you should see this:

Running Blockers

The code comes with a decent set of initial blockers and the first step is to run them. This caches the blockers to disk and populates sample rows for labeling in the database. You can do this by clicking the blockers link. This will give you a page like this:

Simply select all the blockers and click run. It will take around an hour to generate the blockers so feel free to get some lunch.


The code comes with a set of labeled data (called Initial) so this step is optional.

The next step would be to label potential matches as either a match or not a match. You can do this by going to the main page http://localhost:9000/ and selecting a blocker from the drop down list then clicking comparison.

This will give you a page like this, simply enter a name for this batch of labels and keep going. Everytime you click Match or No Match another random potential match will be shown for you to label.

Building a Model

The next step is to build a model, from the home page go to models and then click Create at the bottom. This will give you a page like this.

Fill out the fields in this manner for this test run which gives the best performance we've seen so far:


This step is optional however if you want to check if any labels were potential mistakes you can use the model predictions as a benchmark. Simply go to models from the home page and click update for the model you wish to edit. This will open up a comparison page like this:

Simply click Match or No Match until you run out of mis-matched results.


Now that you have a trained model you can create a submission for the contest. Simply click submissions from the home page and then Create. Now enter information as below to mimic the best model we've built so far:

Now click run. The model will be in data/submission and you'll see a page where you can enter the F1, Precision and Recall from the website.

Creating Blockers

To create a new Blocker will require getting into the code base. Go to library/src/main/scala/oncpmc/blocking/ and create a new class. For example, let's say we want to block on SSN, filter out obviously invalid SSN, and not have the first name of patients match. You'd write a class such as this:

class SSNFilteredBlocker extends BlockerBase {
  override val name: String = "SSNFiltered"

  override def filterPair(p1: Patient, p2: Patient): Boolean = {
    p1.first != p2.first
  override def filter(r: Patient): Boolean = {

    r.ssn.nonEmpty &&
      r.ssn.get.replaceAllLiterally("-","").toCharArray.distinct.length > 1 &&
      r.ssn.get.replaceAllLiterally("-","") != "123456789"

  override def group(r: Patient): String = {

Now go to library/src/main/scala/oncpmc/helpers/Conf.scala and add your class to the list.

Creating Features

Similar to Blockers, creating features requires going into the code. Go to library/src/main/scala/oncpmc/model/ and create a new class that extends FeatureBuilder. Now go to library/src/main/scala/oncpmc/helpers/Conf.scala and add your class to the list.


Thanks to our very own Zoë Frances Weil for her invaluable insight and contributions.

ONC Patient Matching Challenge: Part 1

This is the first of a two part series on tackling the ONC Patient Matching Challenge. The first part reviews background and high level topics while the second part covers the matching engine that was built (and how to use it). As a note, as of this blog post Mindful Machines is in fifth place in the challenge using this approach. As part of this blog post series all the code and data used to achieve our results have been open sourced.


The goal of the Patient Matching Challenge is to determine if two medical patients are the same person or if they are different people based on their demographic data. For example:

Field Patient 1 Patient2
First Name James Jim
Last Name Smith Smith
SSN 055-32-1345 055-32-1344
DOB 01/02/1985 01/01/1111

Are these patients two individuals or the same individuals? As the example illustrates even in someone that we'd think of as the same there are many ways the raw data fields can differ due to things such as nicknames, data entry bugs, missing data, etc.

The fields we have access to are Last Name, First Name, Middle Name, Suffix, DOB, Gender, SSN, Address, Mother's Maiden Name, Medical Record Number, Phone, Email, Aliases.

The data set in question has 1 million patient records and was generated artifcially but based on a real data set. There are no labels in the data and there is no training labeled data set provided. Part of the challenge is as a result to generate your own training data set.

The final submission for the contest are pairs of patients that we believe match which will be evaluated against a golden data set based on F1 score (precision and recall will also be provided back for submissions).


There's a number of challenges that come with working with this data.

Missing Data

The various fields are often missing outright.

Field Percentage Values Missing
last 7.80%
first 7.86%
middle 63.54%
suffix 99.14%
dob 6.20%
gender 5.59%
ssn 27.96%
address1 11.18%
address2 34.15%
zip 11.46%
mothersMaidenName 95.31%
mrn 55.84%
city 11.57%
state 11.66%
phone 6.72%
phone2 86.04%
email 77.51%
alias 94.84%

Invalid Data

There are a non-trivial number of cases of values such as 111-11-1111 for SSN or 123-456-7890 for phone number.

False Matches

Given the above one could think that after filtering out obviously invalid data, a field such as SSN could be used for a first pass. Unfortunately, there are far more invalid matches on SSN than there are valid matches. In other words, "fake" SSN numbers and by extension false positive matches are very common in the data.

Data Entry Errors

There's a number of data entry errors such as digits being switched around or changed. For example, the SSN 45-23-4567 may come in as 45-23-4566.

False Mis-Matches

In addition the the above, in many cases fields that to a human are the same have different raw values in them. For example, Jim and James. Addresses are also plagued with this since there's so many ways to write the same actual address (East, E, Street, St, etc.).

Changes Over Time

This data set is not from a single point in time but from records that would come in over a long period of time. As a result people's phone numbers, addresses, genders and even names can change over time.

Interesting Observations

Medical Record Number

The Medical Record Number (MRN) field is never identical across two patients. However, very often the values are close (+/- 5000) for the same individual.


The basic approach is broken down into a few linked components. First we generate training data by manually labeling potential matches. Then we use this training data to generate a supervised machine learning model. Finally we use this model on all the data (labeled and not) to generate potential matches.


The naive approach would be to first generate all possible combinations of patient records and then to somehow score how well they each match. However, given that we have 1 million patient records the total number of cominations would be quiet large. As a result, we instead first block the data into smaller pieces and within each piece we generate all possible combinations. For example, we might block by SSN and then for all records that have the same SSN generate all pair-wise combinations for scoring.

Some additional considerations when blocking, especially given the next part, is to filter our invalid data and potentially to even filter out pairs which are already covered by a different blocker.


Now that we have used blocking to generate potential matches the next step is to score those matches to generate a training set of data. This is a laborious process and this is why you want the blockers to filter out as many unnecessary or plainly invalid matches as possible.


Now that we have a training set of data the next step is to build a model using it. We could have used a heuristic or manual approach however a model is more flexible and allows the approach to be extended to arbitrary data more easily.

The feature set for the model are mainly levenshtein distances between fields for the two records under comparison. However in other cases hamming distances was used (for fixed width fields such as SSN), set comparison (phone numbers) or numeric differences (MRN). Given that there is a strong interaction effect between fields and a relative lack of data the best model seemed to be a relatively small Random Forests model.


The labeling process being manual is fraught with potential errors and sometimes the model is better than we are. As a result, we want to review any mismatched the model has to ensure it's an issue with the model rather than the labels we put on the data.


Finally we generate all potential matches from our blockers, score them using our model and submit the results.

Next Steps

The best F1 score on a submission for Mindful Machines is 0.978 with an precision of 0.991. The offline model performance is a F1 of 0.995 and a precision of 0.994. That implies there's two causes for the lower submission score:

  • The training labels are overly pessimistic and mark actual matches as not matches
  • The blockers we have are not sufficient and we're missing some subset of potential matches


Thanks to our very own Zoë Frances Weil for her invaluable insight and contributions.

Peapod: A Scala and Spark Data Pipeline and Dependency Manager

Peapod is a new dependency and data pipeline management framework for Spark and Scala. The goals is to provide a framework that is simple to use, automatically saves/loads the output of tasks, and provides support for versioning.

Read More

Confessions Of a Scala Enthusiast

I feel terrible dragging LadyDi into another controversy. But this is not so much a poor reflection on her as it is me.  You see, I have spoken endlessly about the merits of Scala and Functional Programming. For a time, I was even the most viewed writer on Scala on Quora because I wrote so much on the topic!

Unfortunately I have come to realize I have been a lot of talk and very little action. Whoops.

Recently, I have come to take a good hard look at my code as well as libraries I have released into the open source community- and you know what? Most of it is not functional. In most places, I just use (abuse really) the functional properties somewhat enforced in Scala to make my OOP more stable. But that's really just about it. In reality I have written very little functional code, and no strictly functional program. As evident from my passionate remarks on the topic, I am obviously really excited about FP, and even understand how to go about it; but somehow, when push comes to shove, I remain stuck in my un-functional ways. 

Upon speaking to other fellow engineers at work and elsewhere, I have come to see that there are many strong, experienced, creative, innovative engineers who have trouble picking up Scala as a functional programming language. But I had trouble understanding why it was so, considering that many concepts in Functional Programming have very long, deep roots in the history of computer science was well as analytical philosophy. 

Let's take a look at one of the most critical concepts in Functional Programming: referential transparency. Its origins date back to Bertrand Russell's Principia Mathematica. Sure, at first glance you may dismiss this as just another old part of "mathematica", but William Van Orman Quine adopted this concept in his Word and Object published in 1960 where he coined the term "referential transparency".

RT was then thrusted into the spotlight of computer programming just 7 years later in Christopher Strachey's monumental piece Fundamental Concepts in Programming Languages (1967). 

And that's just RT. I have not and will not even begin to talk of lambda calculus  - doing so would derail this post completely. 

So let me reel this post back in. If FP has such footing in both Mathematics and Computer Science, why do most engineers have such a hard time adopting it in how they think about code?

Well, to start, all the above are mainly theoretical concepts. Software Engineering is by nature an applied field so most software engineers are actually employed in industry rather than academia. And so most folks usually don't go beyond a Bachelors or Masters degree (which is a lot given that many of the most dominant and influential figures in the Tech Industry are college dropouts). A core part of almost any undergraduate CS curriculum is a subject typically titled "Data Structures and Algorithms". It is quite the fixture actually. Many of the most prestigious employers in technology (Google, FB, Amazon, ... all the way to Slack, etc) outright inform candidates - regardless of seniority - to expect some questions directly from "Data Structures and Algorithms". In fact Google is notorious for turning renown experts in fields it was hiring for because they failed the DSA part of this rigid interview process.

So what does a typical DSA course consist of and how is this pertinent to our topic here? 

It's actually most made up of search and sorting algorithms, problem-solving strategies, and space and time complexity problems. Here is the root of our headache. It is not really the content of the course, but how fundamental the course has become to landing a job in industry and having a career. 

Let's take just two minor topics that feature prominently in any DSA course:

  • hash tables
  • the QuickSort algorithm (or MergeSort or any D&C algorithm really)

From a functional outlook, any change to a hash table would require creating a copy of the original hash table. This is obviously very inefficient and fundamentally goes against the underlying premise of a course like DSA - solve problems and efficiently so, even if you are going to trade-off reusability and safety. As far as sorting is considered, again we are faced with the matter of efficiency - as in place modifications are axiomatically impossible from a functional perspective. 

So from a very impressionable point in our time as software engineers we are repeatedly told to think about programs almost strictly in terms of efficiency gains and trade-offs; so much so that "good code" in some places has become synonymous with one that provides "efficient" solutions. We are not given the chance to think, and even meditate on other avenues that contribute to writing "good" code. Functional programming is almost never held in the same regard as imperative programming in most CS programs.

So yeah, it's hard to code functionally, and even more importantly, to think functionally when the dominant culture is one that forbids a multi-faceted and a rich discussion on what is important to solving problems. 

Data Pipeline and Task Management: The Unsolvable Problem?

There’s probably more well known data pipeline dependency management and scheduling frameworks than you can say in one breath. Is there a reason for that beyond mere not invented here syndrome?

Read More

We're Back! Introducing LadyDi!

LadyDi: Code less, build more

Read More

Spark, Turn Off the Light on the CLI

We know that Hadoop can be 235 times slower than the command line for smaller data sets, does the same apply for Spark?

Read More

Scala: A Literary Review

Originally, I wanted to write a post about reviewing Apache Spark's Hyper-Parameter Optimization, the many ways in which it could be improved, and unveil a library I have been putting together which integrates seamlessly with Spark's existing ML library, etc. 

But over the holiday break, I was approached by many people around the web with questions like "Why is Scala so Popular?" or "Haskell vs Scala, which would you choose?" or "Should I use Scala or Python when using Spark?". So I decided to first write a post about this language I love so much. 

My appreciation goes beyond the technical norms in which this subject is usually covered - it transcends code and appeals to the overarching topic of human communication; modalities of thought and expression at large. Pondering on this, the image that appears in my head is one that has been etched onto my soul; that of Simone Weil.

Simone Weil could pack into a short sentence what many writers would need an entire volume to express, always supposing they had insights of such depth to express in the first place.

I don't think there is a medium that best encourages engineers to do just that in their domain other than Scala.  Writing short and expressive code (Ruby, Python?) to produce type-safe and high-performance apps (Java, C++ ?) is a duality that only merges most harmoniously when using Scala. The level of expressive syntax rarely seen in other compiled languages; code is strongly typed and supports both multiple inheritance and mixin features.

So one ends up with short, expressive syntax, cutting unnecessary punctuation, and condensing map, filter, reduce  and other higher order operations to simple one-liners.

On the outset, Scala really encourages switching from mutable data structures to immutable ones and from regular methods to pure functions. This has some immediate impacts starting with a reduction in issues rooted in unintentional side effects common in large code bases. So on the outset your code will be safer, more stable, and easier to understand.  Coding like so will imprint on the writer new ways t o view the concepts of data mutability, higher order functions,  sophisticated type systems, etc. 

And quite frankly, concepts like abstract algebra and type theory which would otherwise decompose along with the academic papers they were written on, are brought to life with Scala's sophisticated type system and its  option for custom typing declarations.

There are some downsides however, as there is no singularly adhered-to guide for best practices in Scala. Python, for example, aims to implement a single best practice - even has standard: PEP8. Scala has several different guides on slides - about which people continue to debate endlessly. This is partly achieved by the fact that while Scala encourages functional practices it does not enforce them as religiously as a language like, say, Haskell or Erlang would. But I tend to view such "deficiency" as an upside: to decipher Scala code is to first decipher the writer’s style, ideas, thought patterns, and persona. 

Wikipedia Data in Spark and Scala

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

Setting Up Spark Notebooks on EC2 (No VPN)

Getting up and running with Spark Notebooks (Part 1)

Read More

Spark EC2 Setup and Workflow

So how do we run and deploy code using Scala and Spark in EC2?

Read More

The Limitations of Apache Spark

... all that's Spark doesn't shine.

Read More

Setting up a personal VPN to access AWS instances

The goal of this post is to lay the groundwork for running a Spark driver against a Spark cluster in a AWS VPC. Specifically by setting up a VPN to access VPC instances.

Read More

In the Spirit of Thanksgiving

We don't take ourselves seriously but are two curious folks passionate about applying Machine Learning and Deep Learning to industry and sharing that knowledge with the broader engineering community.

Read More