The Numerical and Insightful Blog: What is different about Big Data?

Tuesday, June 2, 2015

What is different about Big Data?

What is different about Big Data? Other than that it is, er, big?

This is a question that I am frequently asked, most often by people who are plainly skeptical about Big Data simply because it is hugely oversold these days. At the recent Annual Signal and Image Sciences Workshop (more about this later in this blog), one of those skeptics remarked to me that there is a certain sameness to all the articles on Big Data found on the Internet. The first, and often the only, takeaway you get out of those articles is that Big Data is big. Also, the authors of those articles readily give in to the temptation to create clever phrases using the word “big.” The Big Data & Analytics Special Interest Group was inaugurated with a great presentation titled Big Lies and Small Truths. In many articles, the apparently big benefits are touted, mixed in with cautionary mention of some other things Big: Infrastructure, Expense, Dangers, Fatigue and Brother.

So, what is really different about Big Data? To me, the most qualitatively distinguishing aspect of the data in Big Data is this: It has quietly come into existence, with no one in particular having intentionally created it. Where in the past, data creation was a purposeful and protocol driven activity, today vast amounts of data have just come into existence by virtue of the fact that all of us choose to live and conduct business digitally. We do not stop to wipe off fingerprints, we let the soda dribble, we do not sweep away breadcrumbs. It is this nature of its creation, the largely unintentional, evolving, uncurated, uncontrolled and imperfect nature of Big Data that distinguishes it from the more familiar data. The bigness is not the essential feature, even if bigness is inevitable. This type of data quietly accumulated for years till, quite suddenly, we woke up to the Big Possibilities.

However, Big Problems arise due to the fact that the data comprising Big Data does not originate from projects specifically designed to facilitate statistical inference and prediction. This brings us back to the event of May 13, the Annual Signal and Image Sciences Workshop hosted and sponsored by Lawrence Livermore National Laboratories. Unlike most forums in which we hear about problems and innovative solutions that arise due to the bigness of the data, this workshop featured several talks highlighting its more fundamental aspects.

The difficulties that must be resolved before truly significant insights can be obtained from Big Data are many. These require an understanding of the fundamental limits to information extraction and the mathematical trade-offs involved in creating algorithms. The workshop offered a chance to hear about the relevance of kernel PCA, spectral clustering, Johnson-Lindenstrauss lemma, and the Chinese restaurant process. The last has nothing to do with General Tso’s Chicken, which delicacy GrubHub’s elementary Big Data analysis tells us is the most popular Chinese dish in the USA.

Is there anything else that makes Big Data fundamentally different? The answer is yes. Compared to data sets studied traditionally, the data sets in most Big Data scenarios are of high dimension. And also high, of course, is the number of high dimensional observations available in the data set. Dimension, as in 2D and 3D, is a familiar concept. But with a little mental stretching, we can understand the dimension of a data set as the number of variables that are measured for each observation.

High dimensionality is troublesome even in scenarios with moderate sized data sets. When there are too many directions, which are what dimensions are, we don’t know which way to look. For instance, take a look at the Communities and Crime Data Set available in the Machine Learning Repository maintained by the University of California at Irvine. This is a 122-dimensional data set of size 1994. And this doesn’t even qualify as Big Data.

If you need any convincing that data sets can look meaningless except when looked at in just the right way, take a look at the 3D data set I specially prepared for this blog. Each data point in the set is a 3D point fixed in a cubical volume of space. One possible way to present such a data set on a 2D screen is this: imagine placing the collection of points on a slowly spinning turntable and simply watch till the points align in a way that makes sense.

The complexity in real-life high dimensional data hides the meaning really deep within the data. You (or more correctly, your algorithm) would not only need to figure out exactly from what viewpoint to look, but would also need to design funny mirrors to undo nonlinear distortions in the data.

Funny things happen in high dimensional spaces. For example, it is quite rare for two random lines drawn on a page (2D) to turn out to be perpendicular to each other. However, in the high dimensional space of many Big Data sets it is practically a given that any two random vectors will be almost perpendicular to each other. In dealing with data sets of high dimensionality, the data scientist is forced (actually the good data scientist is delighted) to think about issues like noise accumulation, spurious correlations, incidental homogeneity and algorithmic instability which are rare considerations in working with traditional data. These are important issues because at the end of the day the data scientist is expected to help arrive at decisions from the data, or at very least help decision makers better understand high dimensional data and make informed decisions.

“Solving a problem simply means representing it so that the solution is obvious,” Nobel laureate Herbert Simon said it best. Large-scale data visualization is a growing research topic because high dimensional data is growing while our computer screens are not. Ideally, effective visualization algorithms highlight patterns and structure, and remove distracting clutter. Since human perception is limited to three-dimensional space, the challenge of visualizing high dimensional data is to optimally reduce the data and present it in a manner comprehensible to the user while preserving as much relevant information as possible.

Unfortunately, having so many dimensions (and the proliferation of tools being created by big.data.dot.coms) to play with opens up almost limitless ways to look at data. Anyone who wants to find something will find something. On the other hand, it is rarely lack of data that has resulted in bad consequences or policies. Domain expertise remains valuable despite all the data readily available. Astute practitioners place great importance on beginning with posing good questions rather than diving into the data headfirst. The questions shift the focus back to the purpose of utilizing big data – or what should be the purpose – gaining tangible awareness and understanding of real-world behaviors or conditions that lead to an enrichment of society.

[This post first appeared on the website of the IEEE Consultants Network of Silicon Valley.]