SnappyData v0.1 Open Sourced and available for Download

How does one eat an elephant? - one bite at a time. We all know the saying. And, today, it feels like we have already had many bites - towards realizing our vision. It has been a period of intense discovery and reinforcement of our convictions.

It is an honor to present the extraordinary work of around 25 engineers toiling day and night to make this happen over the last 6-9 months. As many of you know, we incubated SnappyData within Pivotal and are fortunate to have the support and insights from our ex-colleagues. We learnt a lot in the process - listening to customers, Pouring through AQP research, algorithms, digging deep into every aspect of the Apache Spark runtime, listening and learning from our AQP mentor - Barzan Mozafari. And finally, the subtle words of wisdom from Paul Maritz and his exhortation to stay focused on the end user guided our thinking.

If you would rather just download and run code than read my rant, here is the github repo, the binary bits and the ‘quick start’. Don’t forget to star the repo if you like what you see :-)

Here is a technical paper that describes our proposition, how we deeply integrate and expand Spark functionality for real-time analytics.

We are eager to get constructive feedback, or anything you wish to opine on the subject. Check out our community channels, social media and all the different ways to contact us here: Community Channels

To get a sense for the magnitude of work that has gone into the project here are some statistics:

- 1750 commits into the Snappydata repository with 23 active contributors

- Comprehensive functionality with docs. Functionality that targets pure SQL as well as Spark programmers. You can get going without any Spark knowledge at all.

- About 100 commits into SnappyStore. Which is a fork from GemFire XD(which the same team built at VMWare and Pivotal) with over 20000 commits and 50 contributors.

- 6 meetups across the US and Asia.

- Our technical paper completed.

- Interest and active engagements with accounts in finance, telco, DoD, travel.

- Over 500 folks registered for a beta invite.

We’re looking for more committers. To find out more about contributing to SnappyData, click here.

We’ll be exhibiting at Spark Summit East February 17-18 as well as hosting a meetup on Monday that same week. Come by and chat with us!

Now, spare me a few minutes so I can explain what we have delivered and what remains to be done.

Why SnappyData?

In recent years, our customers have expressed frustration in the traditional approach of using a combination of disparate products to handle their streaming, transactional and analytical needs. The common practice of stitching heterogeneous environments in custom ways has caused enormous production woes by increasing development complexity and total cost of ownership.

With SnappyData, an open source platform, we propose a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an in-memory transactional store with scale-out SQL semantics).

Moreover, we find that even in-memory solutions are often incapable of delivering truly interactive analytics (i.e., a couple of seconds), when faced with large data volumes or high velocity streams. SnappyData therefore combines state-of-the-art approximate query processing techniques and a variety of data synopses to ensure interactive analytics over both streaming and stored data. Through a novel concept of high-level accuracy contracts (HAC), SnappyData is the first to offer end users an intuitive means for expressing their accuracy requirements without overwhelming them with statistical concepts.

We believe the demand for mixed workloads, especially given the advent of IoT will continue to grow. This has resulted in several composite data architectures, exemplified in the “lambda” architecture, requiring multiple solutions to be stitched together—an exercise that can be hard, time consuming and expensive.

For instance, in capital markets, a real time market surveillance application has to stream in trades at very high rates and detect abusive trading patterns (e.g., insider trading). This requires correlating large volumes of data by joining a stream with historical records, other streams, and financial reference data (which may change throughout the trading day). A triggered alert could in turn result in additional analytical queries, which need to run on both the ingested and historical data. Trades arrive on a message bus (e.g., Tibco, IBM MQ, Kafka) and are processed using a stream processor (e.g., Storm) or a homegrown application, writing state to a key-value store (e.g., Cassandra) or an in-memory store(e.g. Redis). This data is also stored in HDFS and analyzed periodically using SQL-on-Hadoop OLAP engines.

This heterogeneous architecture, which is far too common among our customers, has several drawbacks that significantly increase the total cost of ownership for these companies. We are attempting to significantly improve how the data gets processed and analyzed in the real time layer of such architectures.

So, how are we solving it?

"Spark inside an in-memory Hybrid DB"

We rely on Spark as a computational engine for OLAP as well as the programming foundation with its rich and ever expanding API and eco-system - not only can you run a MapReduce job, but more importantly combine stream processing, rich analytic processing and SQL through a single unified, concise API. We primarily extended its DataFrame/SQL engine so we are able to seamlessly make Tables (or DataFrames) persistent, offer up mutations, transactions (OLTP), extended the SQL so it more compliant and finally, high availability (not just recovery based on replaying data from the source). And, more.

Well, put more simply, we integrated Spark with a highly concurrent, low latency in-memory database.

In reality a distributed database of this scale would take years for a highly qualified team to build and mature. So, what did we do? some of you guessed it - integrated GemFire XD deeply with Spark’s execution engine (Turn’s out Spark’s Catalyst design is quite extensible).

Now, if you are among those that closely watches products in the “Big Deluge” (aka Big data), you would notice that many data products claim integration with Spark - store and retrieve RDDs. This is likely a good approach for batch analytics but this can still be quite difficult in real time scenarios where apps have to interface with multiple products. Ingest in one, mutate another and analyze in a third, etc. It is inefficient, costly to develop and expensive to debug.

What we did could be controversial - everything is running the same cluster. No two (maybe three) products to install, configure, tune or debug. There is no impedance mismatch in the programming paradigms - You pretty much just run Spark code or plain SQL.

Moreover, with tighter control over where data gets placed, we can dramatically cut down on the single biggest expensive aspect of distributed data analytics - shuffling data around on the network, serializing and copying data. For instance, an incoming stream can be partitioned based on where its related data is stored in the cluster.

To make Spark operate in a highly concurrent environment, we had to orchestrate when client tasks are sent over to a scheduler vs directly executed, make sure the spark driver is fully HA, and most importantly allow the Spark execution cluster to be shared across users and apps. What we do is more in line with what you would expect from a database. Data and operations on this data set doesn’t require marshaling the in-memory state in a different cluster. The graphic below depicts the runtime architecture differences.

While we promote a collocated design for real-time apps, we also permit separation in the compute and storage nodes, all in the same cluster.

Here is our paper that explains a lot more details on the architecture. Also note our benchmark where we compared SnappyData’s performance to native Spark caching using TPC-H, and also ran the YCSB benchmark on the same cluster comparing to a commercial in-memory SQL product.

I invite you to download our initial offering and go through our getting started examples available both in SQL as well as Spark programs.

Be gentle. It is still just our first release.

The Spark Database

SnappyData is Spark 2.0 compatible and open source. Download now