A Big Dud on Big Data

The NYTimes asks the question Is Big Data an Economic Big Dud and gets the entire discussion wrong in three critical ways.

Terrible Definition for Big Data

Not everything that happens online or in the cloud is related to Big Data. In particular, the article seems to equate YouTube videos with Big Data. The word "data" has essentially become a meaningless epithet applied to all things digital. This is a horrible practice perpetrated by people who lack a tech background. So here's a handy guide for all the journalists who shunned techie courses in favor of the fun times spent taking humanities courses.

The way to decide if something constitutes "Big Data" is to ask if it's data, and if so, how big it is. Just because something is digital, or because it is the input to a computer program does not make it "data" in the "Big Data" sense. The question is: What is the information content here, and how big is it?

Quantifying information content is actually a complex and deep subject, the domain of an intriguing field called Information Theory, but I'll skip the complexities of Info Theory, not mention Shannon, avoid defining entropy, and describe a simpler technique that anyone can use: simply summarize in English what information is encoded in the data.

For instance, imagine the universe, so rich and full of information that it defies a summary. The universe, or even smallish fractions of it that hold some of its inner workings that we cannot currently summarize with our knowledge base, would be the subject of Big Data. Imagine the data collected by the Square Kilometer Array, petabytes collected per day through thousands of radio receivers sprawled over an enormous chunk of land; that's Big Data, the information collected by these radiotelescopes might even contain traces of other civilizations. Imagine all the biomolecular information collected at various laboratories around the world; again, Big Data, holding secrets to countless drugs. Imagine the information encoded in all books published since the 1600's; Big Data. Imagine the information embodied in the movements of humans, and how it encodes all sorts of complex phenomena; Big Data.

Now imagine a dumb cat video on YouTube. It can be summarized in under 14 bytes as a "dumb cat video." It's not Big Data, no matter how many times it is downloaded.

Terrible Metric for Impact

The metric that people use for economic impact is often GDP. Indeed, the article measures Big Data's impact by how much it grows the economy. Yet even economists will admit freely that the GDP is at best a misleading metric for progress.

The textbook case that illustrates the failings of GDP as a progress metric involves the "broken windows" example. If someone were to go around breaking perfectly fine windows, the GDP of a country would increase as everyone has to buy replacement glass, but almost no one, save for a few glass vendors, would be any happier or better off.

The GDP is an especially misleading metric for Big Data, as Big Data is often used to improve business efficiency, which is more likely to shrink the economy than to grow it in the short term. Imagine that Target studies the buying patterns of its customers so well that it only ships precisely as many items as will be sold, precisely on time -- the net effect will be a reduction in GDP!

Terrible Product Placement

There is a "submarine" below the fold, where there is a reference to a small player in the Big Data space. Even though the company is actually all about improving efficiency, the article doesn't pick up on this fact, and goes on to talk about GDP growth, leaving the reader wondering why this reference was dropped in the first place.

There are countless companies in the Big Data space. Surely, the NYT can afford a few phone calls.

Terrible Framework

Overall, the discussion falls far short of the mark. Big Data, properly defined, is clearly not a dud -- it has the potential to improve our lives in immeasurable ways, through drug discovery, individualized medicine, more efficient business practices, and many others in all aspects of science, every branch of engineering and even, with the resurgence of quantitative methods in the humanities, in liberal arts. But if we let the word get misappropriated, if we blithely apply it to everything digital such that the term loses its meaning, kind of like IBM's erstwhile "autonomic computing," then it is guaranteed to become a meaningless bandwagon that denotes whatever the author feels like that day, and it is guaranteed to fall short of expectations.


comments powered by Disqus