[This is the first part of a Two-Part introduction : Feature Generation and Feature Selection]
So if you are followers of our blog you will notice that we disappeared for a little over 3 months. Our full-time jobs and contributions to open-source projects take precedence over writing. Not enough hours in a day. Sad, but true.
This re-emergence comes with the release of our latest open-source library: LadyDi. As some of you may know, I name all projects I lead after powerful women in history. This one, as the name suggests, is named after Diana, Princess of Wales. She was more than a pretty face; she was an icon and had the entire British Monarchy under finger (one could argue that she planted the seeds for their continued relevancy today).
Like it's namesake, LadyDi, aims to think outside the box and reinvent parts of a system that is very hard to keep in touch with -- in this case Feature Encoding, Transformation, Selection in Apache Spark.
Apache Spark has been going way too fast without giving itself time to mature. Each version introduces new bugs - the fixing of which creates new bugs. A potential read on this could be that all of this is intentional: the harder it is to use and maintain, the more incentive there is to hire a 3rd-party providers like Databricks (whose CEO is the founder of the Apache Spark project). LadyDi will not solve all your problems but it will help with the pain that is Feature Encoding and Selection using Apache Spark's transformers. It leaves your code cleaner and eliminates boilerplate you inevitably accrue if you want to use a variation of Spark Transformers.
The fundamental issue arises from an inconsistency in the output and input requirements Apache Spark's different transformers. This leads to boilerplate and headaches:
val tokenizerA = new Tokenizer() .setInputCol("text") .setOutputCol("textTokenRaw") val removerA = new (StopWordsRemover) .setInputCol(tokenizerA.getOutputCol) .setOutputCol("textToken") val hashingTFA = new HashingTF() .setNumFeatures(100) .setInputCol(removerA.getOutputCol) .setOutputCol("featuresRaw") val standardScaler = new StandardScaler() .setInputCol("featuresRaw") .setOutputCol("features") val pipeline = new Pipeline() .setStages(Array(tokenizerA, removerA, hashingTFA, standardScaler)) val pipelineModel = pipeline.fit(featureData) val hashedData = pipelineModel.transform(featureData) hashedData.select("x", "y", "features").as[EncodedFeatures]
It's pretty gross. Because if right after that you wanted to use another encoder... say VectorAssembler, you'd have to start from scratch!
val assembler = new VectorAssembler().setInputCols( Array("a", "b", "c")) .setOutputCol("features") val pipeline = new Pipeline() .setStages(Array(assembler)) val pipelineModel = pipeline.fit(data) data.take(10).foreach(println(_)) val hashedData = pipelineModel.transform(data) hashedData.select("label", "features")
LadyDi gets rid of this nonsense and you can just chain transformers as you please like so:
def stringFeature() = new Tokenizer() :: new StopWordsRemover() :: New HashingTF().setNumFeatures(10) :: new Normalizer() :: Nil
LadyDi also offers what Apache Spark does not: automated Feature Selection. More on that in Part II ! But for now, feel free to check it all out on Git! A Readme is soon to follow as well!