Mindful Machines Original Series, Big Data: Batch Processing

This is the second part of the Mindful Machines series on Big Data (aka: Big Data Cheat Sheet), in the previous post we covered Batch Storage, in following posts we’ll cover Stream Processing, NoSQL and Infrastructure.

Your historical data is overflowing and you want to do something with it? What do you choose to process it? Presto? Spark? Redshift? MapReduce? In this post we go over the myriad of options out there and how they stack against each other. This isn’t a complete list of available technologies but rather the highlight reel that, among other things, explicitly avoids enterprise solutions although does cover PaaS.

bigdatabatchprocessing2.png

Programmatic Batch Processing

These systems provide a programmatic (Java, Scala, Python, etc.) interface for querying data stored in batch storage systems (HDFS, S3, Cassandra, HBase, etc.).

  • Overall
    • Provide a flexible interface for querying data
    • Schemas need to be managed manually or loaded from files
    • Modern system provide high level APIs that allow for whole query optimizations
  • Apache Hadoop MapReduce: A cornerstone of the Big Data ecosystem which provides a way to efficiently process petabytes of data
    • Open source but PaaS and enterprise versions exist
    • Written in Java
    • Released in 2006
    • Implementation of Google’s MapReduce paper
    • Provides a way to query large amounts of data across multiple machines in an efficient and easy to implement way compared to traditional cluster computing approaches
    • Writing raw Java MapReduce code is relatively complicated
    • Google has not been using MapReduce as it’s primary big data processing model since 2014 and there are newer technologies that are unseating MapReduce in the open source world (see other entries).
    • Requires a Hadoop (YARN) cluster which introduces operational overhead
    • Commercial support provided by Hortonworks and Cloudera
  • Cascading/scalding: Java/Scala, respectively, frameworks that abstract away the complexity of writing MapReduce code
    • Open source
    • Written in Java/Scala
    • Significantly lowers the overhead of writing MapReduce code
    • Can leverage Tez or Flink to significantly improve performance
  • Apache Spark: A highly popular cluster computing framework based on in memory storage of intermediate data
    • Open source but PaaS and SaaS versions exist
    • Written in Scala
    • Started in 2009, described in a paper published in 2010
    • Shines in providing a mostly unified API across Python, Scala, Java, R and SQL that lets you mix together native code and optimized built-in commands
    • Support a streaming paradigm on top of it’s batch processing engine
    • Provides a build in machine learning library (MLLib and ML)
    • Contains significant configurable settings and requires tuning to get good performance
    • Spark’s biggest code contributor and commercial backer (Databricks) markets how much faster it’s proprietary PaaS version is than the open source version which creates skewed incentives for them.
    • PaaS solutions provided by Amazon EMR, Azure HDInsight and Google Dataproc
    • SaaS solution provided by Databricks Unified Analytics Platform
    • Commercial support provided by Hortonworks and Cloudera
  • Apache Flink: Cluster computing framework that aims to provide improvements compared to Spark
    • Open source but PaaS versions exist
    • Written in Java and Scala
    • Released in 2013
    • Provides an API across Python, Scala, Java and SQL
    • Support a batch paradigm on top of it’s streaming processing engine
    • Less configuration overhead than Spark
    • Provides a built in machine learning library, FlinkML, but it’s less comprehensive and performant than Spark’s
    • Newer project that shows a lot of promise but Spark has added significant performance and feature improvement in newer versions that likely more than closed the gap
    • PaaS solutions provided by Amazon EMR and Google Dataproc

 

SQL Batch Processing

These frameworks provide a SQL interface for querying data stored in HDFS or other blob storage systems (S3, etc.) in a distributed fashion.

  • Overall
    • Provide a centralized schema repository
    • Allow for whole query optimizations but restrict you to only using SQL and potentially custom UDFs
    • Most require a traditional SQL server to host table metadata
  • Apache Hive: A SQL layer originally on top of HAdoop MapReduce and now on top of YARN
    • Open source but PaaS versions exist
    • Written in Java
    • Released by Facebook in 2009
    • Custom UDFs, in Java, can be difficult and time consuming to write
    • More optimized for complex analytical queries
    • The newest versions partially bypass MapReduce and run a daemon on individual nodes (LLAP) to further optimize performance.
      • As a result newer versions stack up quite well performance wise against Presto and Spark
    • PaaS solutions provided by Amazon EMR, Azure HDInsight and Google Dataproc
    • Commercial support provided by Hortonworks and Cloudera
  • Apache Spark SQL: A SQL computing layer that is built on top of Spark
    • Open source
    • Written in Scala
    • Started as Shark in 2010
    • Requires a Spark cluster
    • More optimized for complex analytical queries
    • Custom UDFs are easy to write in Scala, Python, Java or R
    • Requires a Spark cluster which can be difficult to tune
    • PaaS solutions provided by Amazon EMR, Azure HDInsight and Google Dataproc
    • Commercial support provided by Hortonworks and Cloudera
  • Apache Flink SQL:  A SQL computing layer that is built on top of Flink
    • Open source
    • Written in Java
    • Released in 2016
    • Requires a Flink cluster
    • Custom UDFs are easy to write in Scala or Java
    • Performance compared to Spark is hard to get numbers for
    • PaaS solutions provided by Amazon EMR and Google Dataproc
  • Presto: A SQL computing layer optimized for massive datasets
    • Open source
    • Written in Java
    • Released in 2013 by Facebook
    • More optimized for many smaller OLAP queries
    • Support for custom Java UDFs
    • Requires tuning to get good performance
    • Provides comparable performance to Redshift
    • Performance improvements compared to Spark although results may differ on Databrick’s SaaS Spark
    • Used by Facebook to query their 300PB data warehouse
    • PaaS version in Amazon Athena
  • Apache Impala: A SQL computing layer released by Cloudera based on Google’s Dremel
    • Open source
    • Released in 2012 by Cloudera
    • Written in C++
    • Support for custom UDFs in C++ and Java (but Java is slower)
    • Based on the Dremel paper by Google
    • More optimized for many smaller OLAP queries
    • Commercial support provided by Cloudera
  • Amazon Redshift Spectrum: A computing engine version of Redshift
    • Proprietary PaaS
    • Unlike Redshift can scale computing independently of storage and access arbitrary file formats stored in S3
    • Limited support for custom UDFs in Python
    • Can leverage Redshift for table metadata
    • Required a running Redshift cluster

 

Data Warehouse

These are full featured Data Warehouses that tie together the data storage and data processing into a single entity.

  • Overall
    • Low latency and high throughput query performance but not necessarily faster than other modern batch processing solutions
    • Columnar data storage
    • Limits on flexibility (data types, UDFs, data processing approaches, etc.)
    • Lock-in if used as primary data store
    • Computing tied to storage system in terms of sc aling
  • Druid: Columnar data store designed to provide low-latency analytical queries
    • Open source
    • Written in Java
    • Open sourced in 2012
    • Provides sub-second analytical/OLAP queries
    • Supports real-time ingestion of data rather than just batch ingestion
    • Provides a limited subset of SQL queries (only large to small table joins)
    • Custom UDF support exists in Java but is complicated
    • Seamless scaling of the cluster up/down independently of storage
    • Leverages “deep” storage such as S3 or HDFS to avoid data loss if nodes go down
    • Complicated infrastructure setup involving multiple types of nodes and distributed storage (S3, HDFS, etc.)
      • Number of external dependencies (S3/HDFS, ZooKeeper, RDBM) which increases operational overhead
    • Well suited for time series data
    • Used by Airbnb, eBay, Netflix, Walmart and others
  •  ClickHouse: Columnar data store designed to provide low-latency analytical queries and simplicity
    • Open Source
    • Written in C++
    • Open sourced in 2016 by Yandex
    • No support for custom UDFs
    • Significantly higher performance than Druid for some workloads
    • Less scalable than Druid or other approaches
    • Leverages Zookeeper but can run a single node cluster without it
  • Amazon Redshift: A fully-managed data warehouse solution that lets you efficiently store and query data using a SQL syntax.
    • Proprietary PaaS
    • General purpose analytical store that support full SQL syntax
    • Limited support for custom UDFs in Python
    • Loading/unloading data takes time (hours potentially)
    • No real time ingestion, only batch, although micro-batches can simulate real-time
    • Need to explicitly scale the cluster up/down (with write downtime for the duration)
      • Storage and computing are tied together
    • Lack of complex data types such as arrays, structs, maps or native json
  • Google BigQuery: A fully-managed data warehouse solution that lets you efficiently store and query data using a SQL syntax.
    • Proprietary PaaS
    • General purpose analytical store that support full SQL syntax
    • Real time ingestion support
    • Limited support for custom UDFs in Javascript
    • Fastest queries than Redshift but more expensive
    • Unlike Redshift it is serverless and you do not need to manage, scale or pay for a cluster yourself
    • Supports complex data types (arrays, structs) but not native json
  • Azure SQL Data Warehouse: A fully-managed data warehouse solution that lets you scale computing independently of storage
    • Proprietary PaaS
    • General purpose analytical store that support full SQL syntax
    • No real time ingestion, only batch, although micro-batches can simulate real-time
    • No real support for custom UDFs (only ones written in SQL)
    • Performance may not be the best compared to Redshift
    • Computing nodes can be scaled independently of storage
    • Lack of complex data types such as arrays, structs, maps or native json

 

RDBM

The traditional SQL database may seem an odd choice however, in addition to simply scaling vertically, with sharding and read-replicas it can scale across multiple nodes. In the following points I’m focusing more on these databases as analytical data stores (relatively few large queries) rather than traditional databases (massive numbers of relatively small queries).

  • Overall
    • Powerful ACID guarantees
    • Row level updates and inserts
    • Requires structured data however some databases also have support for free form JSON fields
    • Can scale to handle large data sizes
      • Vertically: Modern machines can be quite large so even a single machine can store significant data
      • Horizontally: Sharding is possible although it requires additional manual setup and potentially client logic changes
    • There are tradeoffs as you scale (ie: queries across partitions or complex queries)
      • Computing tied to storage system in terms of scaling
      • Multi-master or automatic failover setups can be tricky to get right so often a single point of failure exists
    • Used by Uber and Facebook to handle massive amounts of data
    • There are better purpose built technologies if you truly ne ed to scale big
  • MySQL
    • Open source; PaaS and enterprise versions exist
    • Support for JSON data types
    • Recent support for window functions
    • Commercial support by Oracle (who owns MySQL), PaaS support by AWS
  • MariaDB
    • Open source
    • Originally a fork of MySQL
    • Supports window function
    • No JSON data type but native functions for working with JSON
    • Support for a columnar storage engine which significantly speeds up analytical workloads
    • Commercial support by MariaDB, PaaS support by AWS
  • PostgreSQL
    • Open source;PaaS and enterprise versions exist
    • Support for JSON data types
    • Commercial support by various companies
    • Better parallel single query optimizations than MySQL
    • Third party support for columnar storage engine which significantly speeds up analytical workloads
    • Support for sharding via PL/Proxy
  • Amazon Aurora: Fully managed MySQL and PostgeSQL compatible databases on AWS
    • Proprietary PaaS
    • Automatically and seamlessly allocates storage
    • Data is replicated across and within availability zones
    • Claims improved performance compared to open source versions due to tight coupling with the SSD storage layer
      • PostgreSQL performance may be lower on Aurora
    • Lags behind open source in version support, Aurora MySQL 5.7 support came out over 2 years after MySQL 5.7
    • Does not support clustering beyond read replicas

 

Be sure to tune back for the next installments in this series which will cover Stream Processing, NoSQL and Infrastructure. If you liked this post be sure to reach out on Twitter, comment or share it.

/