Data Pipeline and Task Management: The Unsolvable Problem?

I often like to consider the why behind things. Especially if those things don’t make sense at first glance. For example, it’s often good practice to not remake the wheel and rather to reuse existing frameworks. You probably won’t do a better job than the authors of the framework and will simply hit all the same edge cases, head first and painfully, over time that they’ve already solved. However one area that both me and others seem to be guilty of this sin is in managing data pipelines and big data jobs.

There’s probably more well known data pipeline dependency management and scheduling frameworks than you can say in one breath. It seems like every company at some point decides to build out their own: Yahoo’s Oozie, AirBnB’s Cronos, AirBnB’s Airflow, LinkedIn’s Azkaban, Spotify’s Luigi, and Intent Media’s Mario. That’s just the one’s that have been open sourced and that I can name off the top of my head. . Google and Amazon provide their own cloud solutions as well. It’s also not getting into the enterprise and commercial offerings which are numerous.

The sheer number of frameworks makes one scratch their head as you just don’t see it happen as much in other areas or at least not with such enthusiasm for all the available options.

So is it a case of massive Not Invented Here Syndrome or is there an actual reason for creating your own? As someone who has created and used a number of such data pipeline managers I see it as a mix of both.

In the era of Big Data the data a company processed and uses in many ways defines a company. How much data, what format is it in, how often is it produced, how often is it refreshed, what technologies consume it, how is it consumed, by whom, how often, is data pulled or pushed, etc, etc, etc. you need to consider scalability, deployment and a dozen other areas. The sheer number of factors is mind boggling. A solution that covers even most use cases will be inefficient and bloated due to the sheer diversity of use cases.

Granted that hasn't exactly been a barrier for many other frameworks that have become popular and widespread. The key difference may be that writing a data pipeline manager for a single specific use case is not actually that difficult. A single good engineer over a month could write a basic production ready data manager in a single language that doesn't scale beyond a single machine.

So rather than fighting an exiting manager it’s more efficient to devote some of your engineers to write a specialized version that fits your needs. Especially when out of sheer frustration your engineers have already written their own data pipeline manager on the side. As the company and needs grow more features are added. The data manager guides the company direction and the company guides the data manager so that they both stay in sync.

That said, this may all just be hot air and me merely justifying my own transgressions. I’ll let you decide.

/