SnappyData’s Apache Spark-based, Unified In-Memory Cluster will augment TIBCO’s Connected Intelligence Platform, Increasing Volume, Speed and Real-time Capabilities

We are really excited about what lies ahead, but, first, we take a moment to reflect on our journey. Our journey started with GemFire, an in-memory data grid targeting enterprises looking to harness the power of on-premise grid infrastructure to build a new generation of sub-second transactional systems that could scale to 1000s of users. VMWare acquired GemFire in 2010.

Years of customer experiences with GemFire led us to new big data opportunities focused on our customer’s’ desire to leverage the benefits of in-memory performance, scale-out, real-time updates and access to ever increasing amounts of data. As we started to work with customers, we soon realized that customers had to deal with complex architectures - a range of Hadoop products, streaming, in-memory technology like GemFire and MPP databases. Apache Spark was in its infancy but was quite appealing as a general purpose distributed computational framework for big data. We experimented on the amalgamation of Spark with our in-memory technology, validating our ideas with Pivotal customers and decided to launch SnappyData as a independent company.

The core vision for SnappyData was driven by the realization that true, interactive analytics on big data was a myth. Could we speed things up and simplify the architecture along the way?  We believed that the ability to perform streaming analytics, interactive analytics and predictions in a single cluster would create significant sustainable business advantages to mainstream enterprises. Over the past three years, we have executed to this vision, jumping over many hurdles along the way. Interestingly, as we rounded off 2018, the wider market began to align around the core value proposition that we articulated in a SIGMOD paper three years ago.

All this, of course, wouldn’t be possible without the great support from Pivotal’s (our ex-employer) executives and the phenomenal GemFire engineers who came along with us on this journey. We will be eternally grateful for their support.

A word on TIBCO and its comprehensive stack for modern data analytics

For those of us who have been around enterprise tech for a while, TIBCO is recognized as the pioneer in the real-time data space with its messaging and integration products.

In recent years, the company has made huge investments to develop one of the most comprehensive, enterprise-ready analytics stacks on the market. Meanwhile, a deluge of big data analytic platforms centered around open source technologies have entered the market. Most provide a thin veneer of services on top of popular open source projects primarily targeting big data engineers rather than data analysts and scientists (take Amazon EMR, HDInsight or pick your Hadoop vendor of choice).

This pure, developer-centric approach, while attractive for highly customized deployments, can be complex, cumbersome and very expensive. From our standpoint, TIBCO’s approach of using self-service, AI-driven visualization paradigm across the entire analytic pipeline is quite appealing; it’s simpler, permits faster iterations to derive insight and is cost effective. Whether it is trying to explore data to understand relationships or developing a complete machine learning pipeline that requires extensive collaboration amongst data engineers and scientists, TIBCO's platform enables all this through visual tools with great user experience. TIBCO's comprehensive stack for analytics is depicted in the schematic below:

So, where do we fit in?

SnappyData's mission has been to simplify the overly complex big data architectures we see today, created by stitching together batch oriented (Hadoop, MPP databases) and real-time products (operational databases like MongoDB, HBase, Cassandra, messaging like Kafka, and point streaming solutions). We reduce this complexity by fusing Spark with a hybrid in-memory database capable of both analytic and real time workloads, creating a single unified platform. While leveraging the flexible and increasingly ubiquitous Spark API and framework, we deliver the much-required need to manage data in-memory with consistency, higher concurrency, and much better performance.

While we think that Spark's domination will continue to rise, our experience provides a glimpse into the challenges for wide adoption. Spark and big data technologies, in general, remain challenging to many:  its primary appeal is to engineers who can grasp the intricacies of distributed computing and managing distributed data. To other folks, Spark still has a big learning curve. It is easy and productive tool only for the trained eye.

Now, with integrations into TIBCO platform products, it will be much easier for a diverse audience of data analysts, engineers and scientists to get the benefits of in-memory distributed analytics, higher performance and the flexibility of Spark without first having to master Spark.

For instance, we envision improving the Spotfire experience by delivering interactive queries on big data by transparently augmenting its inbuilt in-memory store. You will be able to use its visual tools to prepare data from big data sources at greater speed by parallelizing the computations within SnappyData and publish the data sets for interactive analytics into Snappy Column, Row or Sampling tables.  You can blend data from different sources, change data types, pivot, group or even infer data relationships all using Spotfire's data canvas which memorializes all the data wrangling steps for traceability and auditing.

We also envision enhancing Spotfire real-time streaming by ingesting streams into SnappyData for more scalable stream analytics, interactive queries that can blend streams with historical or reference data and much more. The combination will enable more insightful and actionable questions like, what is happening now?, How does it correlate to what has happened? And, what action should I take?

While much remains to be done, we have already integrated SnappyData with Spotfire by building a native connector and have started to work closely with the TIBCO Data Science team to explore how SnappyData can be an optional runtime - one that is more performant, flexible and capable than Apache Spark. With its support for higher concurrency, scientists will be able to explore large data sets at interactive speeds using SQL or Spark APIs, manage prepared datasets in SnappyData and be able to run operators in the machine learning pipeline using parallel computations that seamlessly access the prepared datasets. It will also be easier to retrain or continuously test models as new data arrives to deal with model drift issues.

Read more about the announcement on the TIBCO web site

Read Mark Palmer, SVP of TIBCO Analytics blog post.

The Apache Spark Database

SnappyData is Spark 2.0 compatible and open source. Download now