SnappyData Performance

20x faster than Apache Spark

20x faster than Spark

A simple analytic query that scans a 100 million-row column table shows SnappyData outperforming Apache Spark by 12-20X when both products have all the data in memory. This particular test does a very apples-to-apples comparison of query run times on the two products and showcases code generation and vectorization improvements available in SnappyData that results in the performance differential.

You can try out this example yourself by following instructions in our quick start guide, found here

Additionally, SnappyData offers support for collocating tables (it also offers support for rebalancing collocated tables in a running system), which allows joins on co-located tables to eliminate unnecessary shuffles and improve performance dramatically. Coming soon! Our TPC-H benchmark which showcases our performance with fewer shuffles.

454x faster than Cassandra

Since Spark is not a database, you have two options of working with Spark when you are dealing with changing data. Either load a snapshot of the data into Spark and then run queries against it, and repeat the process periodically, or use the database Spark connector to run a Spark SQL query against the database. Both have their own disadvantages. In the first case, you are working with stale data that has been loaded once. In the second, you have to use the connector to move data from the database cluster to Spark, which is inefficient. In this exercise, we ran the same benchmark against Spark-Cassandra.

We used Cassandra version 3.9 and the latest version of the Spark connector for the test. We used a replication factor of 1 and co-located with Spark. The cost of executing the query through the Spark Cassandra connector is much higher compared to running the same workload in SnappyData. The elimination of unnecessary deserialization and data movement in SnappyData, coupled with higher parallelism makes SnappyData a superior choice for running Spark workloads on changing or static data sets.

Steps to run the Spark-Cassandra test


                    //Start a single Cassandra server (release 3.9) and then using
                    //cqlsh create a keyspace called 'test' with a single table 'casstable' –

                    CREATE TABLE casstable (id bigint primary key, sym varchar)

                    //In the Spark shell do the following:

                    import org.apache.spark.sql.cassandra._
                    var testDF = spark.range(100000000)
                    .selectExpr("id", "concat('sym', cast((id % 100) as STRING)) as sym")
                    benchmark("CASS insert perf", 1, 0) {
                    testDF.write.format("org.apache.spark.sql.cassandra")
                    .options(Map("table" -> "casstable", "keyspace" -> "test")).save
                    }

                    // This loads the 100 million records into the table. Took about 1000 seconds.
                    //------- To test Analytic query performance -----

                    val df = spark.read.format("org.apache.spark.sql.cassandra")
                    .options(Map( "table" -> "casstable", "keyspace" -> "test" )).load
                    df.createOrReplaceTempView ("cassTable")
                    benchmark("CASS QUERY perf") {
                    spark.sql("select sym, avg(id) from cassTable group by sym").collect()
                    }

                

The benchmark function can be found here

How SnappyData improves upon Spark Performance

Apache Spark’s optimizations are designed for disparate data sources which tend to be mostly external, such as HDFS or Alluxio. For better response times to queries on a non-changing data sets, Spark recommends caching data from external data sources as cached tables in Spark. Then, they recommend running the queries on these cached data structures where the data is stored in optimized column formats. While this dramatically improves performance, we found a number of areas for further improvements. For instance, a scan of columns managed as byte arrays is copied into an "UnsafeRow" object for each row, and then the column values are read from this row breaking vectorization and introducing lots of expensive copying.

Addressing these inefficiencies, however, is not that easy as the data in the column store may have been compressed using a number of different algorithms like dictionary encoding, run length encoding etc. SnappyData has implemented alternate decoders for these, so it can get the full benefit of code generation and vector processing.

Significant Optimizations Include:

  • SnappyData's storage layer now allows for collocation of partitioned tables. This information has been used to eliminate the most expensive portions of many joins (like shuffle) and, turned them into collocated one-to-one joins. This release significantly enhances the physical plan strategy phase to aggressively eliminate data shuffle and movement for many more cases.
  • Support for plan caching to avoid query parsing, analysis, optimization, strategy and preparation phases.
  • Spark has been changed to broadcast data with tasks themselves to significantly reduce task latencies. A set of tasks scheduled from the driver to an executor are grouped as a single TaskSet message with common task data sent only once instead of separate messages for each task. Task data is also compressed to further reduce network usage.
  • An efficient pooled version of Kryo serializer has been added that is now used by default for data, closures and lower level netty messaging. This, together with improvements mentioned in the previous point, significantly reduce overall latency of short tasks (from close to 100ms down to a few ms). These also reduce the CPU consumed by the driver enhancing the concurrency of tasks, especially shorter ones.
  • Enhancements in column level statistics allow for skipping column batches based on query predicates if possible. For example time range based queries will now be able to scan only the batches that fall in the said range and skip others, providing a huge boost to such queries on time-series data.
  • Alternate hash aggregation and hash join operators have been added that have been finely optimized for SnappyData storage to make full use of storage layout with vectorization, dictionary encoding to provide an order of magnitude performance advantage over Spark's default implementations.

See a benchmark

SnappyData, MemSQL-Spark & Cassandra-Spark: A Benchmark