Unify Data Sources
Unified Data Access
SnappyData through its colocation with Spark provides efficient loading and synchronization across disparate sources such as Parquet, Hive, JSON/XML document databases, NoSQL databases, Relational DBs, Stream sources and text files. Check out Spark Packages to find an up-to-date list of libraries that are available.
Faster, easier than traditional ETL
While most databases on the market today offer integrations to external data sources they are limited to structured relational sources and often require the use ETL tools to transform, cleanse before being loaded.
Through the use of Spark, SnappyData tables can be directly populated from a wide variety of structured as well as semi-structured sources. Spark fundamentally changes the traditional ETL based architecture in two ways:
- Parallel access: Most modern data platforms are distributed in nature and offer parallelized access. SnappyData computes the optimal concurrency to use based on available partitions in the target table to parallely load from the underlying data source.
- Rich built-in transformation APIs: Spark comes with a rich set of transformations (map, filter, flatMap, etc) to work with any of its data sources. And, the use of lambda functions and access to the library from a variety of programming languages makes it very easy for developers and data scientists alike.
Unification through the Data Sources API
The Data Sources API provides a pluggable mechanism for accessing structured data in Spark SQL.
Data sources can be more than just simple pipes that convert data and pull it into Spark. The tight optimizer integration provided by this API means that filtering and column pruning can be pushed all the way down to the data source in many cases. Such integrated optimizations can vastly reduce the amount of data that needs to be processed and thus can significantly speed up processing.
Using a data sources is as easy as referencing it from SQL (or your favorite Spark language):
CREATE EXTERNAL TABLE MyParquetData USING parquet OPTIONS (path "parquet folder")
//And, you can easily import this into any Snappydata table using CREATE TABLE SNAPPYTABLE USING column AS ( SELECT * FROM MyParquetData )
//And, you can execute federated queries like: SELECT * FROM SNAPPYTABLE T1, EXTERNALTABLE T2 WHERE <JOIN T1, T2>
Another strength of the Data Sources API is that it gives users the ability to manipulate data in all of the languages that Spark supports, regardless of how the data is sourced. For instance, data sources that are implemented in Scala can be used by pySpark users without any extra effort. Furthermore, Spark SQL makes it easy to join data from different data sources using a single interface.