The SnappyData Blog

  • Why every Spark developer should care about Kubernetes

    Jags Ramnarayan, Amogh Shetkar, Shirish Deshmukh,

    Kubernetes is all the rage right now, but why should Spark developers care about Kubernetes? How can Kubernetes impact the problems Spark developers run into provisioning, deploying, scaling, monitoring and transitioning Spark clusters? Find out more inside.

  • Real-Time Streaming ETL with SnappyData

    Sudhir Menon,

    In this blog we introduce the rationale for real time streaming ETL and the advantages of the SnappyData approach to real time streaming ETL. We also compare the SnappyData approach to old approaches toward ETL and show how it overcomes limitations. SnappyData's ETL tool is currently under development and will be GA later this year.

  • Making Apache Spark the most versatile, fast data platform ever

    Jags Ramnarayan,

    SnappyData's 1.0 version is now generally available. In the last year, the team closed about 1000 JIRA tickets, improved performance 5-10 fold while supporting several customers and the community. The project roughly added 200K source lines and another 70K lines of test code. Learn more in this blog.

  • How Mutable DataFrames improve join performance in Spark SQL

    Sudhir Menon,

    In this blog we showcase a credit card fraud detection example where performance is limited by a vanilla Spark solution to joining a streaming DataFrame with a static DataFrame. We demonstrate how performance is improved by using Mutable DataFrames inside SnappyData. Code examples are provided.

  • Running Spark SQL CERN queries 5x faster on SnappyData

    Sudhir Menon,

    In a recent blog post, Luca Canali from CERN tested the performance improvement betwen Spark 1.6 and Spark 2.0 using a Spark SQL join with two conditions. CERN discovered a 7x performance improvement from 1.6 -> 2.0. We ran the same query on equivalent hardware on SnappyData and discovered a 5x performance improvement from Spark 2.0 to Snappy. Learn more inside.

  • Joining a billion rows 20x faster than Apache Spark

    Sumedh Wale,

    One of Databricks’ most well-known blogs is the blog where they describe joining a billion rows in a second on a laptop. Since this is a fairly easy benchmark to replicate, we thought, why not try it on SnappyData and see what happens? We found that for joining two columns with a billion rows, SnappyData is nearly 20x faster.