ETL has always been a big part of prepping data for analytics. The origins of this goes back to the development of relational databases and how they were “normal-form” optimized for changes and reads for transactional systems. Once that data rolled off the transactional system, it had to be decomposed into a form that was more suitable for read only analytics. The ETL process became a popular concept in the 1970s. Data extraction is where data is extracted from the source data store; data transformation is all about converting the data to the format or structure for the purposes of querying and analysis; data loading is the phase where the transformed data is loaded into its final analytics destination store such as an operational data store, data mart, or data warehouse. These processes were batch oriented and complex. Chewing gum and bailing wire was very much the order of the day. Latency was never a consideration because these processes were monolithic and offline in 100% of the cases.
As database products proliferated, the list of products that serviced that market grew considerably. Depending on whether you are an Oracle, IBM or Microsoft shop, you had a choice of ETL products. But the product stack remained roughly the same and the cost, complexity and latency associated with doing ETL remained largely constant. Vendors like Talend, and Informatica strived to provide connectors to a wide variety of sources. All of these products essentially run batch jobs and are not suitable for real time ETL. These products were born in an era where distributed computing was not the norm; they were not built from the ground up to be horizontally scalable. The expectation was that these were batch processes and the focus was on getting the job done and specialized knowledge was considered a prerequisite for using these tools.
The diagram below shows how the ETL process, as executed today, is inherently a batch process which is complex, high latency and involves multiple user kernel crossings.
There is clearly a need to reduce the complexity of operating ETL processes. Additionally, monitoring them and ultimately getting the data to its analytic destination in a timely manner without errors or data loss is paramount.
With the advent of big data streaming products like Spark, Samza and Flink, there has been renewed interest in doing ETL, this time without the complexity or latency and also making it real time. This is where customers asked us to craft a solution that was less messy, used standard interfaces, cut down latency and allowed real time transformation of data, thereby enabling quicker insights on the business closer to where the events occurred.
Introducing SnappyData Agile ETL:
Doing real time streaming ETL from operational stores requires a system that has 3 important capabilities.
- The ability to read data from relational and non-relational data sources without impacting the throughput of those systems. This can typically be done using change data capture mechanisms. Given the myriad number of data sources from which data can be read, the ability to capture changes at the source is critical.
- The ability to stream the data into a streaming engine and apply rich transformations on it so as to make it consumable for analytics in the target system.
- The ability to write transformed data out to target systems using standard connectors which can do bulk writes. Messaging providers can also be used to route the data to their final destination
In addition to this, the system needs to have a metadata repository that is updateable using standard interfaces like SQL (or Spark)
SnappyData offers a robust in-memory database that is accessible via ODBC, JDBC and REST interfaces. Data stored in SnappyData can be persisted in the cluster or persisted on S3, Azure Storage etc. SnappyData offers in-database streaming which allows streaming applications to read data from sources like SQLServer, Oracle, MySQL etc. in a highly parallel manner. Data sources can be HDFS or anything that writes to Kafka since SnappyData treats Kafka as a streaming source. The streaming application performs rich transformations on the stream to convert the data to the format in which it can be consumed by the target system. The application uses the SnappyData repository to store metadata needed to complete these transforms. Once completed, it uses Spark connectors to write data out to any target that has a Spark connector defined. These writes can also be partitioned out in the SnappyData cluster leading to high throughput and low end to end latency for the ETL. Most importantly, the system is scalable end to end since you can easily add new sources by configuring new streaming applications to run in the cluster ensuring source scalability, the writes to the target system are horizontally scalable ensuring sink scalability and the cluster metadata itself can grow without bounds without increasing the cost of doing stream transformations.
With SnappyData, real time streaming ETL becomes a robust, easy to use system which leverages all the work that has been done in Spark.
Subscribe to our Live Analytics mailing list to find out when SnappyData agile ETL launches.
Advantages of Snappy ETL over other approaches
SnappyData Agile ETL offers significant advantages over traditional ETL tools, in that it uses real time streaming to do the ETL, offers the ability to scale horizontally, reduces latency, and reduces complexity over traditional approaches. SnappyData also offers advantages over real time streaming ETL products as well in the realm of complexity, cost and performance. Some details below.
In most enterprises, DBAs would not want to install any process on the database machines beyond turning on CDC or turning on supplemental logging which allows transaction log reader approaches to CDC. Alternative approaches that require CDC updates to be written out to a message bus are more intrusive and require adding more infrastructure onto mission critical database machines. As an example, a leading ETL product that works with SAP Oracle databases requires log reader and listener processes on the database machine to read the database logs, which then puts additional requirements on monitoring those components, and consumes CPU, memory and disk (all precious resources) on the database machine. Our approach primarily reads CDC data off the machine and directly into the stream, leading to fewer moving parts overall in the ETL process
For most data sources, SnappyData integrates into the existing security mechanisms in the database. This uses existing security measures and does not open up a new access point on the database.
No Sink Connector Processes:
The stream application transforms the data and microbatches the changes out to the target system directly using Spark connectors. The sink system does not need to pull in the changes and have additional machinery installed on it, which leads to lower operational complexity and better performance. The transformed payload can be emitted to messaging providers as well for consumption by multiple endpoints.
Scalable Metadata Storage:
The SnappyData ETL engine is built into an in-memory datastore which offers a scalable metadata repository which is persisted and loaded upon startup. SnappyData offers robust monitoring capabilities that allow for streaming applications to be managed easily.
Eliminating additional processes eliminates unnecessary user kernel crossings which cuts down CPU costs, improves throughput, as well as improves system availability. The use of partitions for processing streams allows for a massive degree of parallelism which improves performance significantly.
Studio Driven Configuration (Coming Soon):
SnappyData offers the Snappy Agile Studio which allows end users to configure sources and targets and provides them with the ability to store metadata needed to complete the transformations.
This article introduced the rationale for real time streaming ETL and the advantages of the SnappyData approach to real time streaming ETL. In future articles, we will explore the nuts of bolts of how we autoscale the solution, how we make the “Transform” layer pluggable and how checkpointing and resumability is built into the solution. If you would like to be notified when SnappyData Real Time Streaming ETL goes GA, please subscribe to our “Live Analytics” mailing list.