Most enterprises use the public cloud in some form or the other but most have been slow to migrate big data and data warehouses to the cloud. While security and governance are the first things that comes to mind there are other equally important challenges. I go through two below.
The data transfer challenge
How do you move a data lake into a compute cloud, so analytical workloads can run efficiently? Is this really feasible? Perhaps if your lake is more of a puddle, one could evaporate it. ‘It is quite challenging’ would be an understatement.
Let’s try and break down the problem. So, the first question is how long does it take to upload? Then, how often do you upload and what is the best approach? Turns out the answers might be more surprising than you think. If you get a sustained 100Mbps (megabits) uplink speed (which seems unlikely with your ISP to begin with) it will still take an entire day for 1TB of data. And, it will take a mind boggling 233 hours to upload 10TB. And, this assumes nothing goes wrong with your connection.
With this magnitude of delay, any insight will be quite stale by the time it arrives. As you can see conventional wisdom of using your ISP to move data quickly breaks down even with just few TBs of data.
A more practical option is to physically ship data on tape (or SSD or hard disk). Turns out this is a much faster alternative if your data exceeds a few TBs and you don’t have access to very high speed dedicated networks. The chart below shows the stark difference in the time taken. If you are curious about this analysis you will find the details published in this ACM queue article.
Enterprises have to overcome other challenges even if you are able pick a viable approach for data transfer. One common sense approach would be to bulk transfer once and periodically push the “delta” (data updates). Unfortunately, this works if and only if the underlying data sources support some form of CDC (Change-data-capture). And, often you have to resort to maintaining many copies of the data - the data siloed problem permeates most large enterprises.
Analytics is the most expensive workload in the cloud. Period
Analytic workloads are computationally very intensive. Most often, analytics over big data comes down to a few expensive primitive operations - scanning, filtering, joining, sorting and aggregations. If all your data can reside in memory, your analytic workload will be CPU bound.
Take a simple aggregate query (shown below) on about 100 million records (~ 10GB) using Spark 1.6 and all data cached in DRAM. This query will consume all 4 cores in a modern i7 Intel CPU and will take about 6 seconds to execute. Now, say, our production data set is much larger - about 1 TB large. We need a cluster of cloud machines to run our query. For the sake of simplicity, let’s just ignore overhead introduced by distributed data processing and assume all data will continue to be in-memory. We assume linear scaling with CPUs. Then to achieve the same response time, we will need 100X the compute resources - i.e. 400 cores to run a single relatively simple analytical query for a single user.
Increasingly, we find data scientists and explorers work as collaborative teams and ask insightful questions on such massive data sets in an iterative manner - they could be training a prediction model, uncovering patterns in data, etc. In such collaborative environments, questions are ad-hoc and the analytic system must scale to support multiple concurrent users.
Now, we already know that a single query would peg all 400 cores. This implies we will need a lot more additional cores to support concurrent users if we want an acceptable response time. Say we want to run 10 concurrent equivalent queries (supporting lot more users), we would need 4000 cores for a TB of data.
If such a workload were to be deployed on AWS, and run on compute optimized c4.8xlarge machines (assume each one has capacity of 36 cores even though in reality with hyperthreading you may get equivalent of about 18 cores), it would cost about $186 per hour (4000/36 * $1.675) to run our cluster. If, on average we run for about 40 hours/week, that is about $30k per month and more than $350K each year. See EC2 pricing chart below as of Oct 2016 for compute optimized instances.
And, none of this analysis takes into account the resources necessary to stand up your applications, licenses required to run commercial cloud services (e.g. Spark + Hadoop AWS service). You are easily looking at running up a seven figure sum per year to support a rather low-end real-time analytics infrastructure.
Now, here is the obvious observation: the source of a lot of this pain is just one thing - the size of the data. The larger the size, the more CPU horsepower, memory and software licensing is required.
So, Is there a way we can dramatically reduce the size of the data set and still be able to answer the questions? Yes there is - and that is to summarize the data (e.g. summarize as multi-dimensional cubes, pre-aggregated data sets, etc) and cache it in memory. This happens to be the primary approach used today by scores of analytical tools. Unfortunately, there are two big problems - you are likely analyzing stale data (computing these cubes is too expensive and done in batch) and second, it does not work for ad-hoc analytical queries.
Use Statistics, probabilistic data structures
Data scientists usually circumvent the cumbersome nature of big data analytics for everyday tasks by working with uniform random samples of the data set and then apply statistical algorithms to compute estimates.
What if we are able to do something similar - incorporate intelligent sampling and statistical algorithms into a general purpose in-memory, columnar database to deliver fast responses that are nearly perfect. As we explored in previous blogs here and here, there is an abundance of use cases where you can make perfect decisions using nearly perfect answers.
Our system aims to do three things -
- Intelligently sample the data set on frequently accessed dimensions so we have good representation across the entire data set (stratified sampling). Queries can execute on samples and return answers instantly.
- Be able to compute an error estimate for any adhoc query from the sample(s) with high confidence. Irrespective of the query we are always able to compute the accuracy of the answer.
- Provide simple knobs for user to tradeoff speed for accuracy, i.e. simple SQL extensions so the user can specify the error tolerance for all queries. When query error is higher than tolerance level, the system automatically delegates the query to the source.
We refer to this capability as the Synopses Data Engine (internally it uses sampling and other probabilistic data structures).
Let’s go back to our two aforementioned problems - data movement to the cloud, and controlling the cost.
We propose that instead of migrating the full data into our cloud environment, we only manage “data synopses” in the cloud - which is acquired continuously as new data arrives. The query engine is able to compute expected error rate using statistical algorithms from these samples. It computes the result directly from the samples if it is within an acceptable error tolerance (configured by the user) or else it can transparently route the query to run on the source (full data set).
We have far less data to migrate, need fewer CPUs, get faster response times and of course, have to spend less.
Is this panacea? Well, not yet for complex workloads. Sampling and approximate query processing is still maturing but can produce huge dividends if used as an adjunct layer on existing infrastructure. This is exactly how we propose our offering: users don’t have to think about where the data comes from, simply turn an “error tolerance” knob for faster responses.
The figure below depicts this architecture.
If you are intrigued by this new capability and eager to give it a try then...
We are launching a FREE cloud service called iSight-Cloud so anyone can try our engine and provide us some feedback. You can try our simple demos in a visual environment or even bring your own data sets to try. To avoid AWS hardware costs, you can try our shared instances.
SnappyData iSight - Instant visualization using Data Synopses
iSight combines a visualization front-end based on Apache Zeppelin on top of our data engine. With iSight, users immediately get their almost-perfect answers to analytical queries within a couple of seconds while the full answer CAN BE optionally computed in the background. Depending on the immediate answer, users can opt to cancel the full execution early if they are satisfied with the almost-perfect initial answer and no longer interested in seeing the final results. This can lead to dramatically higher productivity and significantly less resource consumption in multi-tenant and concurrent workloads on shared clusters.
Check out the screencast of our demo on iSight:
In summary, while in-memory analytics can be fast, it is still expensive and large clusters are cumbersome to provision. . Instead, iSight allows you to retain your data in existing databases and disparate sources and only caches a fraction of the data using stratified sampling and other techniques. In many cases, data explorers can use their laptops and run high speed interactive analytics over billions of records. Unlike existing optimization techniques based on OLAP cubes or in-memory extracts that consume lots of resources and only work for presumptive queries, the SnappyData Synopses data structures are designed to work for any ad-hoc query.