We have shown multiple times in the past how deeply integrating an in-memory database with Apache Spark delivers myriad performance benefits. From Joining A Billion Rows to Loading and Running Analytical Queries to Streaming Ingestion and Query Execution, SnappyData has consistently outperformed its counterparts. However, we have never publicly released an industry-standard, third party benchmark. Today we release the results of the TPC-H benchmark, comparing SnappyData to Apache Spark.
First, a bit about SnappyData. SnappyData is a hybrid analytics platform that fuses Apache Spark with a robust in-memory database. Because of SnappyData’s performance and concurrency benefits, SnappyData has been selected around the world to power visualization dashboards, streaming IoT analytics, real-time ETL scenarios and more. SnappyData is offered as an enterprise version and open source version.
Benchmarking a hybrid system via a third party is difficult. There are many promising newer benchmarks like TPCx-BB, (SQL queries in Hadoop/Spark on structured data and ML algorithms using semi-structured/unstructured data) and HTAPbench, (Uses JDBC to insert data and performs a mix of OLAP and OLTP queries) but none are quite as well known or standard as TPC-H. TPC-H has been carefully crafted over nearly two decades to present several technical challenges to databases that execute SQL. The 22 queries included demonstrate six common choke points to a database: Aggregation Performance, Join Performance, Data Access Locality, Expression Calculation, Correlated Subqueries and Parallel Execution. As such, outperforming systems on the TPC-H benchmark is an important milestone for any database vendor.
All experiments were run inside the Azure cloud. The cluster had 4 data servers with 256GB of RAM and 32 cores each.
Our initial results for Apache Spark were 2x slower with high variance, but we found that we could eliminate many expensive shuffles across executor nodes by setting “spark.locality.wait” and “spark.locality.wait.process” to 30s and by increasing “spark.executor.cores” to 48 on a cluster with 128 cores total. This makes the number of cores higher than the data partitions. SnappyData too benefits from this tuning, though the default parameters work quite well. We also found that having more server processes with a smaller heap/off heap increased performance over fewer servers with a bigger heap/off-heap in SnappyData. The memory requirement for Snappydata is much lower than Apache spark, making it easier to scale out to multiple servers/executors on a machine.
As described above, TPC-H consists of 22 queries. To get a general idea of how systems perform overall, the metrics Geometric mean (GeoMean) and total running time numbers are presented. GeoMean is a measure of centrality similar to finding a median. Total running time sums the average execution time for each query.
The GeoMean was found to be 8x lower for SnappyData than Spark. In short, SnappyData queries are 8x faster on average:
The total running time of all 22 queries is 7x to 8x lower for SnappyData than Spark. In short, SnappyData runs all 22 queries in less than a seventh of the time Spark takes:
SnappyData was faster than Apache Spark across all the TPC-H queries. Since individual queries test the database differently, there was a large range of speedups. Many were an order of magnitude faster; in the case of Query 6 two orders of magnitude faster. We have elided the individual results of the queries for the sake of brevity, but you can view them as part of our larger benchmark report here.
Why is SnappyData faster?
SnappyData implements a number of performance optimizations that enable it to perform better than Apache Spark.
- - Much more efficient column table scans as compared to Spark’s cache (10x to 20x in single thread queries)
- - 4x to 10x faster hash group-by aggregations than Apache Spark. Similarly, Hash Joins with reference tables are faster in SnappyData as compared to Spark’s Broadcast Join. Further, SnappyData adds combining partial+final results into a single hash aggregation when grouping keys are the same or a superset of a table’s partitioning keys.
- - SnappyData can collocate tables on the same node to avoid shuffles. This improved a large number of cases.
- - SnappyData Hash Joins outperformed Spark’s Sort Merge Join in several cases.
- - SnappyData’s ability to determine estimated sizes better than Spark means broadcast can be used more extensively.
- - The incredible speedup in Query 6 was primarily due to SnappyData’s more efficient handling of DATE string comparisons.
- - Query 19 initially ran slower in SnappyData due to a bug. It was resolved and now enables better handling of common sub-expressions resulting in a more optimized query plan because of a filter pushdown.
So there you have it, SnappyData performs all 22 TPC-H queries faster than Apache Spark, executing them in total 8x faster than Spark. We made the design decision nearly three years ago to integrate Spark deeply with an in-memory database because of countless benefits Spark offers as an analytics platform; we knew that our integration would become a “Spark++," particularly in the realms of performance, concurrency and availability. Establishing SnappyData’s superiority in the TPC-H benchmark is another step in a series of steps that have shown its dominance in performance. What about concurrency? Stay tuned for our forthcoming blog on what happens when thousands of point-lookups are executed simultaneously with multiple concurrent TPC-H analytics queries.