Mindful Machines Original Series, Big Data: Batch Storage

This is the first part of the Mindful Machines series on Big Data, focused on Batch Storage (aka: Big Data Cheat Sheet: Data Storage). In follow on posts we’ll cover Batch Processing, Stream Processing, NoSQL and Infrastructure.

You’ve got a lot of data coming in you and you want to store it somewhere for future analysis? Where do you put it all? In this post we go over the myriad of options out there and how they stack against each other. This isn’t a complete list of available technologies but rather the highlight reel that, among other things, explicitly avoids enterprise solutions although does cover PaaS.

Some good background reading for understanding distributed storage includes CAP Theorem and some of its limitations.

bigdatabatchstorage2.png

RDBM

The traditional SQL database may seem an odd choice however, in addition to simply scaling vertically, with sharding and read-replicas it can scale across multiple nodes. In the following points I’m focusing more on these databases as analytical data stores (relatively few large queries) rather than traditional databases (massive numbers of relatively small queries).

  • Overall:
    • Powerful ACID guarantees
    • Row level updates and inserts
    • Requires structured data however some databases also have support for free form JSON fields
    • Can scale to handle large data sizes
      • Vertically: Modern machines can be quite large so even a single machine can store significant data
      • Horizontally: Sharding is possible although it requires additional manual setup and potentially client logic changes
    • There are tradeoffs as you scale (ie: queries across partitions or complex queries)
      • Computing tied to storage system in terms of scaling
      • Multi-master or automatic failover setups can be tricky to get right so often a single point of failure exists
    • Used by Uber and Facebook to handle massive amounts of data
    • There are better purpose built technologies if you truly need to scale big
  • MySQL:
    • Open source; PaaS and enterprise versions exist
    • Support for JSON data types
    • Recent support for window functions
    • Commercial support by Oracle (who owns MySQL), PaaS support by AWS
  • MariaDB:
    • Open source
    • Originally a fork of MySQL
    • Support window functions
    • No JSON data type but native functions for working with JSON
    • Support for a columnar storage engine which significantly speeds up analytical workloads
    • Commercial support by MariaDB, PaaS support by AWS
  • PostgreSQL
    • Open source; PaaS and enterprise versions exist
    • Support for JSON data types
    • Commercial support by various companies
    • Better parallel single query optimizations than MySQL
    • Third party support for columnar storage engine which significantly speeds up analytical workloads
    • Support for sharding via PL/Proxy
  • Amazon Aurora: Fully managed MySQL and PostgeSQL compatible databases on AWS
    • Proprietary PaaS
    • Automatically and seamlessly allocates storage
    • Data is replicated across and within availability zones
    • Claims improved performance compared to open source versions due to tight coupling with the SSD storage layer
      • PostgreSQL performance may be lower on Aurora
    • Lags behind open source in version support, Aurora MySQL 5.7 support came out over 2 years after MySQL 5.7
    • Does not support clustering beyond read replicas

 

Object/File Storage:

These are distributed scalable ways of storing large amounts of bulk data such as historical logs or images files. These data stores can efficiently read out batches of data for further processing (via Spark, Presto, etc.) and so are capable of acting as a Data Warehouse backend.

  • Overall:
    • Efficient storage of structured, semi-structured and unstructured data
    • Not designed for individual row level reads and writes
    • Not optimized for storing small files/objects
    • Data processing systems (Spark, MapReduce, etc.) can connect to them
    • Computing can scale independently of storage
  • Apache HDFS: The Hadoop Distributed File System (HDFS) provides a distributed way of storing hundreds of petabytes of data
    • Open source; PaaS and enterprise versions exist
    • Written in Java
    • First release in 2006
    • Distributed, fault-tolerant and scalable
      • Data is stored multiple times across the cluster either as full copies or utilizing erasure coding
      • Multiple master nodes allow for seamless failover
    • Stores files in directories but not designed as a mountable file system
    • Various other projects have strong support for HDFS including HBase, Spark, Hive, Hadoop MapReduce, Presto and Flink
    • Even with modern tooling (HortonworksCloudera) there is still a non-trivial amount of operational knowledge needed to run your own cluster
    • Not optimized for storing large number of smaller files (<64mb) such as images
      • You can bundle them together into, for example, SequenceFiles
    • Based on the Google File System paper
    • If you’re running your own hardware or need the performance then it’s a solid choice but otherwise a cloud store like S3 makes more sense
    • Commercial support provided by Hortonworks and Cloudera
  • Amazon S3: A fully managed object/file storage platform provided by Amazon 
    • Proprietary PaaS
    • Distributed, highly available (99.99%) and fault-tolerant (11 9’s)
    • Fully managed so no configuration or manual scaling is necessary
    • Can emulate a file system including listing objects/files in "directories” (technically just uses the prefixes of keys)
    • Can be considered an alternative to HDFS as many projects are able to query data stored in S3 (including MapReduce, Spark, Flink, Presto, etc.)
      • HBase can use S3 as a storage backend if using Amazon EMR
    • List operations can be slow and are only eventually consistent (ie: may return stale data)
      • Latest release of Hadoop includes experimental metadata caching support to work around this
    • Relatively low cost and lack of operational overhead 
    • A solid choice for storing batch data if you’re in the Amazon ecosystem
  • Azure Blob Storage: An object/file storage platform akin to S3 provided by Azure 
    • Proprietary PaaS
    • Distributed, highly available (99.99%) and fault-tolerant (11 9’s or greater depending on replication configuration)
    • Fully managed so no configuration or manual scaling is necessary
    • Strongly consistent in list operations unlike S3
    • Can be considered an alternative to HDFS as many projects are able to query data stored in S3 (including MapReduce, Spark, Flink, Presto, etc.)
      • HBase can use Azure Blob Storage as a backend natively
    • Relatively low cost and lack of operational overhead 
    • A solid choice for storing batch data if you’re in the Azure ecosystem
  • Google Cloud Storage: An object/file storage platform akin to S3 provided by Google 
    • Proprietary PaaS
    • Distributed, highly available (99.9% to 99.95%) and fault-tolerant (11 9’s)
    • Fully managed so no configuration or manual scaling is necessary
    • Strongly consistent in list operations unlike S3
    • Can be considered an alternative to HDFS as many projects are able to query data stored in S3 (including MapReduce, Spark, Flink, Presto, etc.)
      • HBase is not supported, Google instead prefers you use the HBase interface for BigTable
    • Relatively low cost and lack of operational overhead
    • A solid choice for storing batch data if you’re in the Google ecosystem

 

Columnar NoSQL

Instead of storing data as rows these databases instead store data as columns or groups of columns. This approach allows for much higher performance in cases where only a subset of the columns needs to be read for a given query.

  • Overall:
    • Efficient storage of structured data
    • Allow for key level write and reads in addition to bulk reads and writes
    • Data processing systems (Spark, MapReduce, etc.) can connect to them and as a result computing can scale independently of storage
    • Can act as a regular NoSQL database
  • Apache Cassandra: A masterless database that avoids any single point of failure and aims for high availability. 
    • Open source but PaaS and enterprise version exist
    • Written in Java
    • Initially developed by Facebook and publicly released in 2008
    • Availability over consistency (AP) in general
      • Supports tunable per query consistency with rigorous testing to avoid issues
    • Uses a limited SQL syntax for queries (no joins)
    • Requires structured data whose schema is defined ahead of time
    • No external dependencies needed (like ZooKeeper) which makes deployment relatively easy
    • Overall good performance with an original emphasis on batch write performance
    • Supports secondary indexes
    • Used by Uber and Netflix
    • Based on the Dynamo paper from Amazon
    • First class support for Spark and lack of dependencies makes for a great Spark data storage backend
    • Commercial support provided by DataStax
  • Apache HBase: A strongly consistent (CP) database built on top of HDFS and Zookeeper. 
    • Open source but enterprise versions exists
    • Written in Java
    • First release in 2007
    • Consistency over availability (CP)
    • Native support for being a MapReduce data source. Spark support through a third-party and soon built-in support in HBase 3.0.0 (may be a while judging by the 2.0.0 releases)
    • Support for Coprocessors that allow for custom code to easily run on HBase servers
    • Read performance on par or better than Cassandra but slower write performance
    • Dependency on HDFS and Zookeeper means that deployment is fairly involved if you haven’t already bought into the Hadoop ecosystem
    • Supports secondary indexes
    • Used by Facebook among others
    • Inspired by BigTable from Google 
    • Dependency on HDFS makes it harder to justify as a batch data store versus plain HDFS
    • Commercial support provided by Hortonworks and Cloudera
  • Google BigTable: A fully managed database that aims for high consistency
    • Proprietary PaaS
    • Consistency over availability (CP)
    • Fully managed
    • HBase is inspired by BigTable and BigTable provides an open source HBase compatibility layer
      • Third party libraries (Spark, MapReduce, etc.) that have HBase support generally also have BigTable support
      • No support for custom Coprocessors
    • No support for secondary indexes
  • Amazon DynamoDBA fully managed database that aims for high availability
    • Proprietary PaaS
    • Availability over consistency (AP) in general but has support for tunable per query consistency
    • Fully managed
    • You do not need to define a schema ahead of time for each table but merely the index keys
    • Like Cassandra its derived from the Dynamo approach (from Amazon) and so shares various similarities
    • There is also support for MapReduce and Spark operations against DynamoDB tables
    • Supports secondary indexes.

 

Data Warehouse

These are full featured Data Warehouses that tie together the data storage and data processing into a single entity. In the future Data Processing post in this series we’ll go more into the computing performance of these solutions which is often a key reason for using them to store data.

  • Overall:
    • Low latency and high throughput query performance but not necessarily faster than other modern batch processing solutions
    • Optimized columnar data storage
    • Limits on flexibility (data types, UDFs, data processing approaches, etc.)
    • Lock-in if used as primary data store
  • Druid: Columnar data store designed to provide low-latency analytical queries
    • Open source
    • Written in Java
    • Open sourced in 2012
    • Provides sub-second analytical/OLAP queries
    • Supports real-time ingestion of data rather than just batch ingestion
    • Provides a limited subset of SQL queries (only large to small table joins)
    • Seamless scaling of the cluster up/down independently of storage
    • Leverages “deep” storage such as S3 or HDFS to avoid data loss if nodes go down
    • Complicated infrastructure setup involving multiple types of nodes and distributed storage (S3, HDFS, etc.)
      • Number of external dependencies (S3/HDFS, ZooKeeper, RDBM) which increases operational overhead
    • Well suited for time series data
    • Used by Airbnb, eBay, Netflix, Walmart and others
    • A solid choice for a specialized analytical/OLAP system but otherwise other options are more flexible and lower overhead (at the cost of slower queries)
  • ClickHouse: Columnar data store designed to provide low-latency analytical queries and simplicity
    • Open Source
    • Written in C++
    • Open sourced in 2016 by Yandex
    • Significantly higher performance than Druid for some workloads
    • Less scalable than Druid or other approaches
    • Leverages Zookeeper but can run a single node cluster without it
  • Amazon Redshift: A fully-managed data warehouse solution that lets you efficiently store and query data using a SQL syntax. 
    • Proprietary PaaS
    • General purpose analytical store that support full SQL syntax
    • Loading/unloading data takes time (hours potentially)
    • No real time ingestion, only batch, although micro-batches can simulate real-time
    • Need to explicitly scale the cluster up/down (with write downtime for the duration)
      • Storage and computing are tied together
    • Lack of complex data types such as arrays, structs, maps or native json
  • Google BigQuery: A fully-managed data warehouse solution that let’s you efficiently store and query data using a SQL syntax. 
    • Proprietary PaaS
    • General purpose analytical store that support full SQL syntax
    • Real time ingestion support
    • Unlike Redshift it is serverless and you do not need to manage, scale or pay for a cluster yourself
    • Supports complex data types (arrays, structs) but not native json
  • Azure SQL Data Warehouse
    • Proprietary PaaS
    • General purpose analytical store that support full SQL syntax
    • No real time ingestion, only batch, although micro-batches can simulate real-time
    • Computing nodes can be scaled independently of storage
      • Can pause computing resources if not using to save cost
      • Storage in Azure Blob Storage
    • Lack of complex data types such as arrays, structs, maps or native json

     

    Be sure to tune back for the next installments in this series which will cover Batch Processing, Stream Processing, NoSQL and Infrastructure. If you liked this post be sure to reach out on Twitter, comment or share it.

    /