Unified Online Transactions + Analytics + Probabilistic Data Platform
What technical problems will SnappyData solve?
While we have seen a flurry of data management platforms to store and analyze unprecedented volumes of data, most of them focus on static/batch data analysis. A common pattern is to ingest data into HDFS and use a SQL-on-hadoop database to optimally run OLAP queries over this large data set. We have also witnessed the emergence of distributed in-memory data platforms for OLTP (Microsoft Hekaton, Oracle in-memory, GemFire, etc). Our experience in the big data industry has shown that customers either have, or want to combine OLAP and OLTP types of processing in their workloads.
This has resulted in several composite data architectures, exemplified in the lambda architecture, that require the stitching of streaming analytics (Apache Storm, etc) with a real-time in-memory OLTP store (high writes) and SQL-on-Hadoop. The stitching of all these different platforms is an exercise that can be hard, time consuming and expensive.
What we are finding is a growing number of cases where customers would like instantaneous analysis on streaming data, written to a data store that support high writes, point updates, point queries as well as analytic queries. Basically the new data is instantly available for analytics. By leveraging existing OSS work (especially Spark) and our own expertise, we intend to offer an open source, real-time, Always-on, operational data platform that fuses streaming analytics and in-memory data management with OLTP and OLAP capabilities into a single scale-out platform.
The promise of in-memory scale-out analytics is realized optimally through a columnar memory layout. Analytic class queries that require distributed shuffling of records, however, can take tens of seconds to minutes - hardly permitting interactive analytics. Additionally, some of these applications could have hundreds of concurrent users running such queries. We are faced with the problem of delivering google-like speeds for analytic queries with modest investments in cluster infrastructure and far less complexity than today. Snappy aims to fulfill this promise of true interactive speeds.
Our technical vision in brief:
- A spark based, in-memory unified cluster that provides streaming analytics, OLTP and OLAP data management (interactive queries, point updates) leading to dramatic reduction in complexity for both developers and Ops.
- Dramatic reduction in resource requirements (CPU, memory, network) through the use of probabilistic data: statistical techniques to maintain a small fraction of the data and respond to analytic questions at Google-like speeds.
Some use cases for SnappyData?
Dynamic Ad placement:
Millions to billions of user clicks and interactions are logged to HDFS and analyzed in a offline manner to produce millions of user profiles that are placed in memory. This data combined with other CRM data like ad campaigns and user demographic information is used in real time to place the appropriate ads.
A post trade surveillance application consumes trades throughout the day at a high and often bursty rate. The application uses past trades from individual traders and analyzes current and past market movement trends to flag suspicious or fraudulent trades. This involves high write rates and streaming analytics as well as execution of low latency analytic queries on current and historical data.
A telco network needs to find the 100 customers with most dropped calls along the time dimension or the top customers with dropped calls for some time frame and for each cell tower. These simple counts quickly become expensive as customers and calls grow. Using statistical techniques these count queries could be millions to even billions of summaries (counters) continuously updated in real time.
The SnappyData Architecture:
The core principle of our architecture is fusing the Spark runtime with the GemFire XD runtime. GemFire XD is an evolution of Pivotal’s GemFire, an in-memory data base (IMDB) that goes under the open source name of Apache Geode. GemFire XD arose from the customer need to use SQL to interact with the GemFire IMDB. In short, GemFire XD can be thought of as the GemFire (Geode) IMDB with an integrated SQL execution engine (GemFire XD will soon be open sourced). Why is GemFire XD a key component of SnappyData’s in-memory OLTP+OLAP vision? GemFire XD’s row oriented data store offers transactionally consistent point updates and executes point queries very efficiently (through in-memory indexes). Its eager and synchronous replication with failover offers much desired ‘Always-on’ capabilities. Its persistence model offers a recovery mechanism that maintains disk state without encumbering disk seek latencies. It’s easy to see how these benefits support SnappyData’s OLTP requirements and complement Spark’s OLAP features.
Spark SQL has made some important advances by managing immutable DataFrames in column compressed manner in off-heap memory. We intend to use Spark SQL as the extensible dialect for our platform. By extensible, we will offer several enhanced SQL capabilities to create and manage “Row” tables, support for Update/Insert/Delete statements(i.e. offer mutability with transactional semantics), and to manage synopses in tables (more on synopses later).
Client applications can either submit Spark programs or connect using JDBC/ODBC to run interactive queries or manipulate data. As the runtime is also the Spark runtime, all the native Spark features/APIs will naturally work - MapReduce, extensive transformation APIs, efficient processing of HDFS data, Machine Learning, Streaming operators, etc. In a sense, Spark becomes the new “stored procedure” language in the SnappyData platform.
For stream processing, we rely on the micro-batching approach in spark streaming to parallely process with high throughput. We extend Spark streaming in a few important ways:
- The users will be able to use SQL to analyze streams as ‘Continuous queries’. Streams can be accessed as a native Spark DStreams or as a tables permitting arbitrary joins with other streams, and, of course, reference data in Row tables or history in column tables.
- By deeply integrating Spark with our store, we will provide “exactly once” processing guarantees.
The figure above depicts how data might flow through the runtime as it goes from its raw input state through transformations and lands into the in-memory store.
An example of how SnappyData might be used
- Initially the cluster is bootstrapped from enterprise databases. Reference data like customer profiles, prediction models, etc are loaded. Typically a lot of this data is relatively small, is often mutated and can be managed as “Row” tables.
- While applications could be batch oriented (we support this through Spark) and load data using Spark’s connectors we think the Snappy platform will be most useful when the input is streaming in nature. The Streaming application itself could normalize, filter data into records and could be written in any of the languages Spark supports. Again, Spark is the SnappyData programming model.
- Streams will be parallely processed with partitioning features that will attempt to colocate streams with reference data by partitioning both the stream and reference data on the same key (for instance, accountID). Application programs can model streams as Tables and execute SQL queries on these incoming batches (which typically will be a window of time).
- The data generated from stream processing first arrives as rows and in a buffered state. The buffer is either processed by SnappyData’s Synopsis engine which may sample the buffer with intelligence so queries may be answered with sufficient accuracy. While one will be able to execute interactive queries on approximate data it can also be used for extremely fast streaming analytics (for instance, re-compute sensor measurement trends over time by joining to history).
- Incoming data can be preserved in an approximate form (stratified samples) or stored “exact” in column or row tables. As the data goes through stages (stream queue, then row buffer) we collect enough data before storing them in columnar tables. This staging strategy allows us to get good compression ratios in the column tables.
- Interactive queries can run against these “exact” in-memory tables or “approximate” tables depending on the tolerance for error.
Synopsis engine, approximate data, stratified sampling, statistical computing….
We’ve used these terms liberally and interchangeably up to now. As a topic, they could fill up many blogs; we’ve described our usage of these terms and how these techniques fit into the SnappyData platform in another blog post: Approximate is the New Precise.
What is the role of Spark in the solution? What specific areas of Spark is SnappyData dependent on?
SnappyData will leverage several components of the Spark stack. We leverage Spark SQL’s extensible catalyst engine for columnar data management and distributed query execution, the transformations and actions offered in core spark and Spark streaming, and the streaming engine with its time windowing capabilities, connectivity to data sources, and integration with scheduling engines.
How is SnappyData enhancing the functionality in spark?
It is important to note that Spark is a computational engine not a data store. We will fuse a distributed, in-memory, always-on data store in the Spark run-time making the combined offering look more like a modern in-memory OLTP+OLAP store.
Is the plan to open source the Snappy platform? if so, does it go back into Apache Spark?
We intend to contribute back our enhancements to Spark after discussions with the appropriate Spark component owners. Additionally, our current plan calls to offer a significant portion of the Snappy platform via a Apache v2 license.
We are working towards an early beta and expect to have code available for download in the December/Early 2016 timeframe. If you haven’t visited our landing page. go to www.snappydata.io and sign up to be notified when our beta is launched. You can also follow us on twitter, facebook & linkedin. We’re on irc.freenode.net in #snappydata as well.