Big Data: Here to Stay, but with Caveats
Wednesday, July 30, 2014
Criticism of big data is due to three paradoxes. For starters, it's ubiquitous but hard to define.
At first glance it’s strange that the idea of big data — often imagined simply as the use of software to explore and analyze unprecedented amounts of information — should be controversial at all. Haven’t we depended on it for years in our everyday lives? Haven’t credit card companies mined their vast records to prevent fraud by identifying suspicious purchase patterns? Don’t politicians of both parties analyze voting records down to the precinct level to identify opportunities for tipping elections? And don’t nearly all of us rely on Google and other search engines to exploit complex algorithms that identify the webpages that will be most relevant to our queries?
Isn’t big data the result of Moore’s Law, the exponential reduction of the cost of storing, processing, and retrieving information over decades? It shouldn’t be so surprising that smartphone software like the iPhone’s Siri should be able to answer many real-time questions, since current smartphones are more powerful than the original supercomputers. So why the attacks on big data even in some pro-technology strongholds like the Financial Times?
First, what is big data?
Big data breaks from the tradition of rigorous definition of mathematical techniques. That’s not a slander by the movement’s critics; it’s an acknowledgement by two of the foremost enthusiasts, Viktor Mayer-Schönberger and Kenneth Cukier in their modestly titled Big Data: A Revolution That Will Transform How We Live, Work, and Think “There is no rigorous definition of big data.” Big data is not — as I once believed — simply the next stage of the 200-year-old discipline of decision making known as scientific statistics. It is not simply the use of vast data sets, powerful processors, and sophisticated software to continue the work of pioneers like the late-18th-century Scottish entrepreneur William Playfair, who invented the bar graph and pie chart.
Some data experts even believe that the spread of skilled analysis actually increases the role of chance.
Even in the late 20th century, there was no bright line between computerized statistics and the new big data movement. Consider the example that may be most famous. In the 1990s, the general manager of the Oakland Athletics, Billy Beane, continuing the work of his mentor Scott Alderson, built a winning team on a small budget. They used the statistical analysis of Bill James, called sabermetrics, as reported in Michael Lewis’s best-selling Moneyball. James and Beane applied complex formulas to replace simple, conventional measurements of players’ ability like runs batted in. But the underlying data set of baseball was the same. And while the formulas seem daunting, they can be defended intuitively in a way that that some of today’s big data conclusions, based on vast financial and geographic information banks, no longer are. In fact, James introduced sabermetrics in 1980, when electronic data storage capacity and processing power were small indeed by today’s standards.
“Data analytics,” or “data mining,” as big data techniques are often called, is becoming a new information-science profession with a glamor that challenges traditional statistical education. In July 2013, then-president of the American Statistical Association, Marie Davidian, noted with concern in a membership publication that at a leadership summit of a new National Consortium for Data Science (NCDS) on data analysis and health, there were only two known statisticians among 80 participants. (Most of the others were computer scientists and programmers, geneticists, and other biomedical researchers and administrators, according to the published list.)
One computer scientist who helped pioneer the field from the 1980s, Gregory Piatetsky-Shapiro, described the relationship thus: “Statistics is at the core of data mining — helping to distinguish between random noise and significant findings, and providing a theory for estimating probabilities of predictions, etc.” But, he continues, data mining “covers the entire process of data analysis, including data cleaning and preparation and visualization of the results, and how to produce predictions in real-time.” Data mining goes beyond statistics to include techniques like pattern recognition and machine learning.
We can compare the rise of the data scientists to the emergence of the quants, mathematicians and physicists who helped transform Wall Street in the 1990s, when data analysis was emerging. Traditional economists and financial analysts were also quantitative thinkers, of course, but the newcomers brought fresh skills, developed in different contexts, for profiting from the latest technology with new predictive models.
So why is big data the subject of criticism? Two New York University professors even listed “Eight (No, Nine!)” problems with it in a New York Times op-ed. It may fit the pattern that the Gartner Group information technology consultancy calls the hype cycle of innovation: utopianism and backlash leading to a “plateau of productivity.” The reason is in three paradoxes of big data — hyperpragmatism, mutual neutralization, and advocacy.
Hyperpragmatism can be defined as putting what works ahead of the research for fundamental understanding or principles. Every serious beginning chess player, for example, studies theoretical ideas originally developed by the first world champion, Wilhelm Steinitz (1836-1900). But computerized records of master games, plus analysis by powerful chess software, shows that some openings violating these principles contain traps that can be used by experts against unwary opponents.
Mayer-Schönberger and Cukier defend big data against the charge of aiming for “the end of theory,” as Wired magazine’s Chris Anderson asserted in a widely discussed essay. It has lots of theory behind it, they argue, and so it may. But they also acknowledge that Anderson is correct that “causality ... is being knocked off its pedestal as the primary fountain of meaning.” He meant that the idea of building and testing models, widely considered the essence of scientific enquiry, can now be replaced by computer algorithms that are not claimed to explain anything. Early computer translation programs, based on rules of grammar plus bilingual dictionaries, failed despite large investments. Google’s translation service uses the power of statistical prediction of correspondence based on a trillion words in all kinds of translations on the web, including its own book scan project. It doesn’t apply principles of syntax and semantics; it calculates the probability of a match, as IBM’s Watson supercomputer does in answering Jeopardy! questions. (David Ferrucci, who oversaw the Watson project at IBM and is now working at a hedge fund, came to believe that big-data correlation should be complemented by rule-based models for the most powerful simulation and prediction.)
Hedge funds are some of the most sophisticated users of big data, yet their returns have lagged behind the Standard & Poor 500 for the last five years.
Credit card company decisions illustrate how data-driven pragmatism can be useful in identifying correlations, even if it does not always offer an explanation for its findings. Big-data analysis by a Canadian retailer found that people who buy chrome skull accessories are especially likely to miss payments; purchasers of anti-scuff furniture pads are punctilious. Of course there might be a causal model at work here; these purchases could be proxies for conscientiousness. The trouble is that in many life-or-death questions big data still has limits. Seismologists, for example, have massive data on earth movements and tremors through an impressive network of borehole sensor stations, but we remain unable to predict when earthquakes will actually occur.
The second paradox is that big-data techniques that work, even when kept confidential, eventually are leaked or reverse engineered, leading to reduced effectiveness. Much of big data is used competitively, and in an economy dominated by large corporations, data analysts on all sides tend to cancel each other out in the long run. Big dogs pick up the techniques pioneered by underdogs like Beane, especially those who open up to best-selling business journalists like Lewis. In chess, massive computerized databases put the entire history of the game and grandmaster-level analysis in the hands of every serious player — but that has meant only that more experts from more nations are competing for the same limited prize money. Hedge funds are some of the most sophisticated users of big data, yet their returns have lagged behind the Standard & Poor 500 for the last five years. Only two mutual funds out of over 2,800 have been able to beat market indexes consistently over the past five years.
Normally brilliant people can often find profitable anomalies in markets for a year or two, then others reverse engineer their strategies. Only a handful of wizards, like the mathematician James Simons, founder of the Renaissance Technologies, can constantly renew the talent in their organizations to replace rapidly obsolescent big-data strategies with new ones. And now that the secret is out about credit card companies and merchants’ analysis of spending patterns, it’s only a matter of time before some customers learn industry methods and manage their cash and credit card spending accordingly.
A number of data experts even believe that the spread of skilled analysis actually increases the role of chance. The New York Times science columnist John Tierney recently paraphrased the research of the Columbia University investment strategist Michael Mauboussin regarding business and sports alike: “As the overall level of skill rises and becomes more uniform, luck becomes more important.”
The ultimate paradox of big data lies in the form that both enthusiasts and skeptics use to think about it. So far at least, there is no reliable quantitative information on implementation and success of big data projects, partly because (as we have seen) it’s hard to say where traditional statistics end and the new data analysis begins, and partly because most big data projects are proprietary. It might be possible to get around the obvious reporting bias in the future, but it will be challenging; even in academic science there are serious questions about the omission of negative results, the so-called file drawer effect. But it’s a mistake to suppress failures. Analyzing the errors of big data, like the Google flu prediction project, is the best way to combat what some sympathetic researchers call “big data hubris” and make data analysis more rigorous. Subjective human expertise has also been introduced to big data algorithms through the back door; for years Google has been using trained evaluators to help improve search results.
Ironically, the great strength of the big data movement now is the kind of anecdotal evidence and advocacy rhetoric that it seeks to replace. There are no big data about big data.
Edward Tenner is author of and . He is a visiting researcher in the Rutgers Department of History and the Princeton Center for Arts and Cultural Policy Studies.
FURTHER READING: Tenner also writes "Could Computers Get Too Smart?," " " Blake Hurst adds "Big Farms Are About to Get Bigger." Mark P. Mills discusses "The Next Great Growth Cycle."
Image by Dianna Ingram / Bergman Group