Making Apache Spark the most versatile, fast data platform ever
We released SnappyData 1.0 GA version this week. We couldn't be more proud of our team for their effort, dedication and commitment. It has been a long, enduring test for most in the team having worked on distributed in-memory computing for so many years going back to our days in VMWare-Pivotal. Just in the last year, the team closed about 1000 JIRAs, improved performance 5-10 fold while supporting several customers and the community. The project roughly added 200K source lines and another 70K lines of test code.
In this post, I will focus on the broader audience still trying to grasp SnappyData and its positioning with respect to Spark and other fast data systems. I am hoping you will be intrigued enough to give SnappyData a try and star our Git repository.
While Apache Spark is a general purpose engine for both real-time and batch big data processing, its core is designed for high throughput and batch processing. It cannot process events at a time, do point reads/writes or manage mutable state. Spark has to rely on an external store for sharing data, low latency access, updates and high concurrency. And, when coupled with external stores its in-memory performance is heavily impacted by the frequent data transfers required from such stores into Spark’s memory. The schematic below captures this challenge.
Schematic 1: Spark’s runtime architecture
Schematic 2: Spark enhanced with Hybrid (Row + Column) store
SnappyData's mission, at its core, is to extend Spark so it also becomes the best data platform for real time computing.
SnappyData adds mutability with transactional consistency into Spark, permits data sharing across applications and allows a mix of low latency operations (e.g. a KV read/write operation) with high latency ones (a expensive aggregation query or ML training job).
SnappyData introduces hybrid in-memory data management into Spark. SnappyData's Spark++ clusters can now analyze streams, manage transactional and historical state for advanced insight using Spark as its programming model. The single cluster strategy for compute and data provides the best possible performance while avoiding expensive stitching of complex distributed systems (the norm today). The schematic above depicts this fusion of Spark with a hybrid in-memory DB. Applications can submit Spark jobs to be executed inside the cluster achieving up to 3 orders of magnitude performance gains compared to a design where Spark is purely a computational layer. SnappyData also provides a very high performance connector for Spark so any Spark cluster can connect to SnappyData as a store. In this “smart connector” mode, unlike other connectors, SnappyData uses the same columnar format as Spark for data movement along with several other optimizations making it significantly faster than every store we have tried in the market today.
The details of how Snappydata is architected is described in this CIDR paper. The schematic below captures its key components and the eco-system.
Snappydata core, Spark Facade and eco-system. The components in the center and highlighted in light green are some of the SnappyData extensions into the Spark runtime.
The following sections try to reason why we think SnappyData is the first platform that offers the speed, scale and versatility required to deliver true interactive analytics on live, big data sets.
Deep insight on Live data
We have seen a flurry of stream processing engines (Google data flow, Flink, Storm, IBM Infosphere to name a few) all aimed at capturing perishable insight. Insights obtained as events happen and, if not acted upon immediately, lose their value.
They all support programming constructs that can use either custom application logic or SQL to detect a condition or pattern within the stream. For instance, finding the most popular products selling now, top-K URLs, etc can all be continuously captured as KPIs and made accessible to any application. For deeper insight, you often need to correlate current patterns to historical patterns or relate current patterns to other contextual data sets (e.g. Are sudden changes in temperature and pressure correlated to previous known patterns in a manufacturing device? Did the environment play a role?). Such insight, once presented to users, solicits other, deeper questions.
Often this requires large scale distributed joins, integration with disparate data sources, evaluating on incremental training of ML/DL models, and even permitting instant visualization/data exploration tools that pose ad-hoc questions on all of this data. While some of the existing tools would permit joining a stream to related data sets, what we find is that these related data sets are managed in enterprise repositories that are themselves large, diverse (NoSQL, SQL, text, etc) and also constantly changing. Imagine a telco operator placing location sensitive offers/ads on mobile devices that require offers and subscriber profiles from CRM systems or from partner systems. You have to combine a high velocity CDR (call data record) stream with live data sets that resides in CRM systems.
Trying to execute a real time join with CRM systems in real time is not possible. What you need is a engine that supports replicating changes in the CRM system into a system that also manages the stream state (CDC). Moreover, this dependent state itself can be large. Most of the current solutions fall short. Streaming systems focus on how to manage streams and offer primitive state management.
True analytics on live data requires a different design center that can consume any "live" data set in the enterprise, not just incoming streams.
SnappyData aims to offer true deep insight on any live data - event streams (e.g. sensor streams), trapping continuous changes in enterprise databases (e.g. MySQL, Cassandra, etc), historical data in-memory or even data sets in remote sources. For instance, you can run a continuous or interactive query that combines windowed streams, reference data and even large data sets in S3/HDFS. You can even use probabilistic data structures to condense extremely large data sets into main-memory and make instant decisions using Approximate query processing. The SnappyData design center is more like a highly scalable MPP database that runs in-memory and offers streaming support. Scalable, high performance state management with native streaming support.
Figure 4 shows what the current “state-of-the-art” is for a streaming system. Figure 5 depicts what a SnappyData-enabled system might look like.
Figure 4: Challenging to run complex analytics with Streaming systems
Figure 5: SnappyData’s architecture for LIVE analytics
Don't all modern Business Intelligence tools support "Live" analytics?
While there are several BI tools in the market that support live analytics by connecting directly to the source, most don't scale or perform. The prolific response in the BI tools community has been to pre-aggregate or generate multi-dimensional cubes, cache these in-memory and allow the BI visualizations to be driven from this cache. Unfortunately, this doesn't work for two reasons:
- These caches are read-only, take time to construct and don't provide access to the live data we expect
- Increasingly, analytics requires working with many data sources, fluid NoSQL data and too many dimensions. It is far too complex and time consuming to generate multi-dimensional cubes.
Figure 6 captures the challenges in existing tools for business intelligence.
Figure 6: Challenges in business intelligence tools
SnappyData manages data in distributed memory offering extreme performance through columnar data management, code generation, vectorization and statistical data structures. And, natively supports all the Spark data types: nested objects, JSON, text, and of course, structured Spark SQL types.
The net effect is to enable access to live data in streams, tables and external data sources to any modern BI tool. Our goal is to offer interactive analytic performance even for live big data across many concurrent users.
Why not Spark itself?
If you are Spark versed you might be wondering why this isn't necessarily possible in Spark. All in-memory state in Spark is immutable thereby requiring applications to relegate mutating state to external data stores like Cassandra. All analytical queries require repeated copying and even complex deserialization making analytical queries very expensive to execute. Furthermore, all queries in Spark are scheduled as jobs, often consuming all available CPU executing one query at a time, and hence offering low concurrency in query processing. In analytics, workloads are often a mix of expensive scans/aggregations or drill down questions that look for pinpointed data sets. Unlike Spark, SnappyData distinguishes low latency queries from high latency ones and ensures that application requests are handled with high concurrency.
Working with heterogeneous data
Live data often arrives in a range of formats - text, XML, JSON, custom objects in streams. Data can be self-describing, nested, composed as a graph and not compliant to a pre-defined schema.
SnappyData capitalizes on Spark’s ability to connect to virtually any data source and infer its schema. Here is a code snippet to read a collection of JSON documents from Mongodb and store in memory as a column table in SnappyData. Note that there was no need to specify the schema for the table. Internally, each JSON document in the collection is introspected for its schema. All these individual schema structures are merged to produce the final schema for the table.
Snappydata applications can connect, transform and import/export data from S3, CSV, Parquet, Hadoop, NoSQL stores like Cassandra, Elastic, Hbase, all relational databases, object data grids, and more. Essentially, the data model in SnappyData is the the same as Spark.
The deep amalgamation with Spark truly makes SnappyData the most versatile in-memory database available in the market today, or so we surmise.
It is difficult to find big data stores that don’t claim interactive speeds. The common myth is that if data is managed in memory combined with enough CPU (and even GPU), queries will execute at interactive speeds. Practitioners will note that this is often not the case. For instance, you may be surprised to find that Spark executes queries faster on parquet files than on its in-memory cache (see charts below). This has much to do with the layout of the data in-memory and if the code is written keeping in mind modern day multi-core CPU architectures.
SnappyData heavily leverages columnar storage in-memory, using row storage when appropriate, co-locates related data sets to avoid shuffles and a range of other optimizations as described in an earlier blog.
Here are the results from a simple performance benchmark on a macbook pro (4 cores) comparing Spark caching, Spark+Cassandra, Spark + Kudu. We measure the load times for 100 million records as well as the time for both analytic and point queries. We use a single table with 2 columns ( id, symbol). We made sure all comparisons were fair and went through basic documented tuning options.
Note: Lower is better
Note: Snappydata and Spark did not define id as a key in above test.
To summarize, Snappydata is a horizontally scalable cluster that offers high concurrency, extreme write speeds, supports built-in streaming analytics, transactions, Spark transformations, Spark SQL, probabilistic data structures for interactive speeds and data reduction, parallel connectors to virtually every modern data store, extremely high performance analytic querying, run ML/DL training or scoring and supports modeling data as SQL, Objects, JSON or even graph.