What SnappyData, Spark, and in-memory database topics got the most attention on our blog in 2017? If one thing is clear, it’s that our readers love performance blogs. Of the top five blogs, four were either benchmark or performance related. I suspect if our Dec 2017 benchmark blog on SnappyData vs Kudu, Cassandra & Alluxio had been written earlier, it would have edged out the fifth non-performance blog in the top five.

The most impressive result in 2017 is from our Lead Engineer Sumedh’s blog: Joining a Billion Rows 20x Faster than Apache Spark. Nearly 1 in every 2 of you that came to our site touched this blog in some way which results in a number of page views we were blown away by. Many of you landed on it as your first interaction with SnappyData and many found it through cross links on the site. Overall, you made it very obvious what type of content you are interested in from us, so you can expect more of it in 2018.

Our engineering team puts a lot of time into creating blog posts that disseminate knowledge we’ve gained from turning Spark into a Database. We were amazed to see how much traffic the blog accounted for in 2017; you gave us a great sense of accomplishment. That said, we know this is just the tip of the iceberg; we are committing to an even stronger year of quality content in 2018.

With no further ado, here are the top five most viewed blogs in 2017, arranged by unique pageviews. See if there were any top blogs you missed and take a look at how SnappyData is unifying Data Theory and Practice.

1.) Joining a Billion Rows 20x Faster than Apache Spark

By Sumedh Wale. This blog draws from a billion row join Databricks performed earlier in the year on vanilla Spark. We wondered, how would the exact same join fare on SnappyData? It turns out SnappyData performed better, to the tune of 20 times.

2.) SnappyData, MemSQL-Spark & Cassandra-Spark: A Performance Benchmark

By Yogesh Mahajan. This blog compares SnappyData to MemSQL-Spark & Cassandra Spark in the context of an ad analytics use case. It tests ingestion performance and concurrent query workload performance. In includes an associated github repo and screencast to see how we did it or try it yourself.

3.) Your Father’s Database: What if Apache Spark was also the Database?

By Pierce Lamb. The only non-performance blog of the bunch, this blog draws on a blog post from Databricks where the author claims that Spark is not your father’s database. The Databricks blog offers advice on how to not get stuck on traditional SQL database-type use cases in Spark. It turns out that when Spark is deeply integrated with an in-memory Database, Spark can look, feel and act exactly like your father’s traditional SQL database which this blog demonstrates.

4.) Running Spark SQL CERN Queries 5x Faster on SnappyData

By Sudhir Menon. Luca Canali of CERN wrote a blog testing Spark 2.0 on queries used inside CERN. Once again we thought, how would these queries fare on SnappyData? Long story short, the queries saw about a 5x improvement over Spark.

5.) How Mutable DataFrames Improve Join Performance in Spark SQL

By Sudhir Menon. The muse for this blog was an email into the Spark Mailing List. The user had a Spark Structured Streaming application he was joining to a batch DataFrame that was getting updated occasionally. Because DataFrames are fundamentally immutable, he was having to restart the streaming app every time the batch DataFrame was updated. A feature of SnappyData is the ability to mutate data inside DataFrames in both row and column format, much like a classic SQL Database. This ability leads to big performance gains versus the vanilla Spark scenario where the batch DataFrame has to be rebuilt, and the streaming app needs to be restarted.

SnappyData is available for download.

The Apache Spark Database

SnappyData is Spark 2.0 compatible and open source. Download now