Your Father's Database: What if Apache Spark was also the Database?

In April, Databricks did a webinar titled: Not your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture. The talk was also given at Spark Summit East:

As a solutions engineer for Databricks, the presenter is very familiar with the way enterprises have attempted to use Spark. She says that every user of Spark she talks to starts out with something like this:

In short, everyone starts out wanting to use Spark like a SQL database. However, she notes, this architecture performs nothing like a traditional SQL database. Loading HDFS files into memory and processing them will almost always be slower than processing inside a traditional SQL database. And so, the ecosystem has responded by introducing connectors from Spark to more traditional datastores:

And these have worked pretty well in terms of providing Spark with the ability to read and write to these databases. However, having isolated database processes connected to isolated Spark processes introduces its own set of inefficiencies. For example, an extra serialization/deserialization step needs to be performed when hitting the database which can sometimes be the biggest blow to execution performance. Further, a connector does not fully solve the concurrency issue in Spark: executors still live and die with jobs. Along the same lines, the Spark driver is not highly available, so if it fails, the job fails.

In this sense, a connector provides a temporary solution, but it doesn’t really give us what we want: a fast, highly concurrent, highly available architecture for executing Spark jobs, particularly with the use of SQL. What are some other ways enterprises try to use Spark that are natural for a SQL database but inefficient in Spark?

Random access, a common pattern for any SQL database will be unequivocally slower in Spark: it has to churn through files to find your row.

Frequent inserts + querying the new inserts. This pattern will be slow for the same reason as above: instead of mutating data in place Spark has to create new files that will be loaded and churned through for querying.

Attempting to analyze changes to one’s production SQL database as they happen. The presenter notes that many customers she’s spoken with start by analyzing nightly batches of their SQL database. Over time, they want to analyze what’s happening in real-time: they want to send the same UPDATE statements to both their live SQL database and to the Spark cluster so they can run analyses as their data changes. However, as she notes, UPDATES simply are not supported in Spark.

Serving concurrent requests. Here, the presenter notes that Spark will be slow when serving concurrent requests for external reports (most likely scan or aggregation OLAP queries). Spark executes job requests sequentially, so if many clients submit jobs simultaneously, they will start to pile up and perform slowly.

Searching the database. Again, like some of the earlier issues discussed, Spark will go out to the file system and churn through each file to deliver results to search queries.

In summary, Spark is bad at:
- Random Access
- Frequent Inserts
- Performing UPDATEs
- Concurrency
- Searching

So when the customers she is encountering in the field are trying to use Spark with the above patterns, it sounds like what they want is a Spark that acts “like their father’s database.” A fair question worth asking is, “Why are people trying to use Spark as a traditional database when they are several such databases already out there?”

The answer to that lies in the fact that customers have for years been trying to work with mixed workloads, where they can combine traditional database functions with newer data processing techniques like streaming, elastic batch processing and most recently, machine learning. Before Apache Spark became the de facto platform for doing in memory big data processing and before every database built a connector to Spark, there wasn’t a unified platform that allowed you to combine these workloads with traditional database workloads. Combining Spark workloads with traditional database workloads allows end users to build mission critical applications that can do OLTP, perform SQL and map-reduce based analytics, streaming and machine learning in a single platform.

As it turns out, Spark can now be used just like our father’s database. SnappyData tightly integrates Spark with a distributed scale-out SQL database (GemFireXD) such that using Spark as if it were also a database is much easier and faster than using Spark with HDFS or with a connector to a database. SnappyData fuses the Spark Executors and data servers onto the same JVM processes; this offers a host of benefits including meeting all the issues raised in the talk. SnappyData can perform:
- Fast random access
- Fast querying of rapid inserts
- UPDATEs and mutations on data
- Extremely high concurrency
- Fast searching with indexes

Moreover, all of these benefits are accessed almost exactly the same way as shown in the above slides: we initiate the SnappyContext by passing it the SparkContext and from that object we use a .sql method to pass strings of SQL. All returned types are just DataFrames. Additionally, because of that tight coupling in the same JVM, SnappyData avoids the added serialization of going over a connector when one wants to hit the database. SnappyData should always be more performant than a connector. Finally, we’ve extended Spark to make executors long running and make the driver node highly available, this guarantees the entire Spark system inside SnappyData is highly available. These are just a few of the benefits we’ve added to the Spark ecosystem.

With SnappyData, Spark appears just like a traditional SQL database. So if you’re trying to use Spark in one of the ways mentioned above, or you’ve wired Spark up with many other products to meet the issues described above, consider trying SnappyData which will lower your complexity. SnappyData is available for download.

Learn more and chat about SnappyData on any of our community channels:

Stackoverflow
Slack
Mailing List
Gitter
Reddit
JIRA

Other important links:

SnappyData code example
SnappyData screencasts
SnappyData twitter
SnappyData technical paper

The Spark Database

SnappyData is Spark 2.0 compatible and open source. Download now