Approximate is the New Precise
Do we always need to know the exact answer to every question especially if that answer is slow? Or is it far more valuable to have near correct answers based on a clear understanding of the past, but delivered much more quickly? If you are measuring how many units of a product sold by year or month to find the most popular do you really care about the precise number? If you are analyzing click through rates on web pages for placing ads, what you care about is the trend and how it is changing or will change over time. Data analysts and scientists know this very well and commonly use sampling on large data sets to characterize or understand the data and build models.
Sampling is used pervasively in data mining for efficiency. What is required is more intelligence about the common dimensions used by queries and statistical guarantees that rare sub-groups of data (groups with low representation) are also being sampled. Stratified sampling is one answer - this sampling method ensures that each strata has an proportional chance of being sampled. There are many variations to the sampling algorithms and all are well covered in the literature. As statisticians point out very clearly, the trick is to ensure that the sample is an unbiased representation of the population, which ensures that computations performed on it yields results that are very close to the exact answer and any variance is statistically accurate and entirely predictable.
(Big data when visualized is sampled. You cannot tell the difference between a 100% accurate answer vs one that is only 90% accurate. The graphic from Gizmodo above makes the point)
Sampling as a technique to reduce costs has intuitive semantics. The techniques one can use to compute aggregations (say a count, sum, avg) are typically straight forward and well researched. But, in practice, data is often skewed and the algorithm may decide to disproportionately sample to capture outliers. In this case, you have to use weights on the data elements so computations like “sum” return appropriate results. What becomes obvious is the need to compute an error bound for every answer.
It is time for these synopses techniques to be fully incorporated into current big data stores. Big data solutions must build broad statistics on exact volumes and frequently executed queries, their filter conditions and the common dimensions used for analytics.
In the SnappyData platform, developers specify commonly used query dimensions and the proportion of the exact data that should be retained in-memory. SnappyData creates synopsis tables based on this input that concurrent users can query and expect responses that are orders of magnitude faster than executing the query over the complete data set. These queries can have user-specified error or latency bounds and these tables can be continuously updated by incoming streaming data. Consider the class of Internet of Things use cases in which data is constantly streaming in from sensors. A conventional deployment to support this case might be a 10TB, 50 node cluster. Utilizing SnappyData’s synopsis tables, and assuming a sample ratio of 1-2% of the original data, the system can be shrunk down to only a few nodes and 100GB of memory while continuing to ingest data and support the same volume of concurrent access to the system.
Stratified sampling with well defined error bounds represent a core part of our thinking, and we augment that with the use of probabilistic data structures like histograms and sketches to provide even quicker answers to a smaller subset of problems.