Apache Spark is a general purpose engine for both real-time and batch big data processing. Spark Jobs can cache read-only state in-memory and designed for batch processing. It cannot mutate state (updates/deletes), share state across many users or applications (other than using Hive), or support high concurrency. To support real time applications that necessitates such features, Spark is increasingly used with more memory-oriented realtime stores like Apache Kudu, Cassandra, Redis, Alluxio, Elastic, etc. And, of course, SnappyData.

In this blog, we evaluate the performance of the “faster” store options for Spark - Alluxio, Apache Cassandra, Apache Kudu, Apache Spark native caching and SnappyData (Embedded Mode and Smart Connector Mode). For achieving the best possible performance and fairness in comparison, we ensured that all the data was in-fact provisioned in memory across all products. We also attempted Redis but failed to query the large data set chosen for this benchmark.

The benchmark itself is relatively simple so we aren’t challenged by the data type and feature idiosyncrasies of the various products. We use the FAA published OnTime Arrival data for all domestic flights in the united states (128 million rows). We measure time to load, latencies for simple analytic queries, point queries and update queries.

We invite you to download the benchmark and try it out.

Summary of the Key Findings  

  • - While Spark’s built-in in-memory caching is convenient, it is not the fastest. What you might find interesting is that Spark’s parquet storage is faster than its in-process caching performance. It is not just about managing state in-memory. The structure, the optimizations in the engine are much more paramount.
  • - SnappyData’s embedded mode  (Store embedded within Spark execution engine) is faster than the other compared products, as described below.
  • - SnappyData in embedded mode avoids unnecessary copying of data from external processes and optimizes Spark’s catalyst engine in a number of ways (refer to the blog for more details on how SnappyData achieves this performance gain).
  • - SnappyData’s Smart connector mode (Store cluster is isolated from Spark cluster) is also much faster than other compared products, as described below.
  • - We observed Alluxio to easily beat all the other store options (except SnappyData). We think this is primarily attributed to its block-oriented storage layer and seems to have also been tuned to work with Spark more natively.  But at the same time, it does not support data mutation, which other stores do.
  • - Cassandra, while likely being the most popular storage option with Spark, was the slowest. This is primarily attributed to how Cassandra internally manages rows. It is a row-oriented database and lacks all the optimizations commonly seen in columnar databases. Analytical queries tend to scan all rows and this results in over pointer chasing - which is very expensive in modern-day multi-core architectures. Moreover, the data format differences result in high serialization and deserialization overhead.

Experiment Description

For this experiment we used:

  • - Machine: A single machine with 16 cores and 112 GB RAM.
  • - Dataset: On-Time Arrival performance for airlines (1995-2015) dataset (~128 million records), published by FAA - Federal Aviation Administration.

The volume of the dataset used is ~18-20GB comma-separated values(CSV) format data (1.6GB of Parquet format data).

You can refer to the Product Runtime Architecture section for a logical overview of the products, and the configuration used for setting up the cluster for each product.

We made sure that all comparisons were fair and went through recommended tuning options for each product.

As per our understanding of Cassandra, we made sure to use the optimized configuration parameters and have published the best results for a single node cluster. We have used the default heap size settings and set row_cache_size_in_mb to 20480 MB.

Load Performance

Parquet data is loaded into a Spark DataFrame, which is then written into tables in each respective store. The following chart shows the load time in milliseconds for each product.

Results

In SnappyData embedded mode, the load performance is approximately:

  • - 2x faster than Alluxio
  • - 8x faster than Kudu
  • - 60x faster than Cassandra

In SnappyData Smart Connector Mode, the load performance is approximately

  • - 2x faster than Alluxio
  • - 8x faster than Kudu
  • - 60x faster than Cassandra

Analytical Query Performance

The following five analytic queries were used to measure SnappyData’s OLAP performance against the mentioned datastores:

Response time in milliseconds for each query on Snappy, Spark & Alluxio

In SnappyData embedded mode, the queries are approximately:

  • - 6x-15x faster than Spark
  • - 3x-6x faster than Alluxio

In SnappyData Smart Connector Mode, the queries are approximately

  • - 3x-6x faster than Spark
  • - 2x-3x faster than Alluxio

Response time in milliseconds for each query on Snappy, Kudu & Cassandra

In SnappyData embedded mode, the queries are approximately:

  • - 80x-120x faster than Kudu
  • - 400x-1800x faster than Cassandra

In SnappyData Smart Connector Mode, the queries are approximately

  • - 20x-65x faster than Kudu
  • - 200x-700x faster than Cassandra

As expected, embedded mode is faster in every scenario. Smart connector mode turns out to be faster as well, though by less margin. The difference in speed in both modes against Cassandra is significant. It demonstrates the way SnappyData has been optimized for analytics workloads (though it can simultaneously be used for transactional workloads). We will discuss more of why SnappyData performs so well at the end. Next, we will look at how each datastore performs with respect to point lookups.

Point Lookup Query Performance

The following two queries were used to measure performance of point lookup queries.

Response time in milliseconds for each query on Snappy, Spark & Alluxio

In SnappyData embedded mode, the queries are approximately:

  • - 30x-60x faster than Spark
  • - 15x-30x faster than Alluxio

In SnappyData Smart Connector Mode, the queries are approximately

  • - 2-3x faster than Spark
  • - Equivalent with Alluxio

Response time in milliseconds for each query on Snappy, Cassandra & Kudu

In SnappyData embedded mode, the queries are approximately:

  • - 10x-30x faster than Kudu
  • - 3000x-9000x faster than Cassandra

In SnappyData Smart Connector Mode, the queries are approximately

  • - Kudu performs slightly better than Smart Connector
  • - 150x - 300x faster than Cassandra

The performance benefits are a bit less pronounced in point lookups than analytical queries, particularly with respect to Alluxio and Kudu. Embedded mode is faster than all products. Smart connector mode matches Alluxio’s performance and gets slightly beat by Kudu. That said, the performance gains over Cassandra are still significant.

Smart connector’s poorer performance against Kudu stems from unnecessary data movement because of a lack of filter pushdown. This issue is being addressed for future releases.

Now that we’ve looked at load times, analytical query execution and point query execution, what about point update queries? In the next test, we’ll ask whether or not data can be mutated in each store and measure the performance of mutations.

Data Mutation

Which of the aforementioned datastores offer data mutation support?

  • - Spark: Update: none
  • - SnappyData: Update: supported
  • - Alluxio: Update: none
  • - Kudu: Update: supported
  • - Cassandra: Update: supported

Point Update Query Performance

We have performed the following two operations to measure performance of point update/mutation operation.

Response time in milliseconds for each query on Snappy, Cassandra & Kudu

In SnappyData embedded mode, the queries are approximately:

  • - 6x-11x faster than Kudu
  • - 300x-400x faster than Cassandra

In SnappyData Smart Connector Mode, the queries are approximately

  • - 6x-8x faster than Kudu
  • - 250x-300x faster than Cassandra

Conclusion

The bullets below summarizes the comparison results. SnappyData’s performance is approximately 1-3 orders of magnitude faster when compared to the other stores for Spark.

SnappyData Embedded Mode:

  • - Load: 2x-60x
  • - Analytical: 3x-1800x
  • - Point lookup: 10x-9000x
  • - Mutation: 6x-400x

SnappyData Smart Connector Mode:

  • - Load: 2x-60x
  • - Analytical: 2x-700x
  • - Point lookup: 0.7x-300x
  • - Mutation: 6x-300x

Why?

The most innovative idea offered by SnappyData in the Spark Database Ecosystem is the presence of a memory optimized datastore in the same compute cluster as that of Spark. In other words, in the SnappyData cluster, both Spark compute components (like its driver and executors), and GemFire - a memory-optimized data management system for storage have been deeply integrated. This integration of components (data processing & data management) into one provides significant performance benefits over any system where both components interact over a connector.

In addition to this, other performance enhancements provided by SnappyData are:

  • - Maximizing the use of code generation,
  • - Plan caching,
  • - Optimized scan,
  • - Using intelligent decoding and skipping of blocks of data during scan-based on predicates,
  • - Partition aware execution,
  • - Off-heap usage (for both storage and execution resulting in less garbage generation),

These have contributed significant improvements in query latencies and ingestion speeds. More detail on these improvements are provided on this page.

SnappyData is available for download.

Run this benchmark yourself

We encourage you to run this benchmark yourself. All the scripts and information required to reproduce this benchmark are published in the benchmark GitHub repository.

Additional Performance Links

The Spark Database

SnappyData is Spark 2.0 compatible and open source. Download now