To understand how something will end, we have to understand how it began.
Around 2003, someone very very high up in the intelligence community visited Cornell. You might think you know people high up in the intelligence community; this person was a click higher. Naturally, I signed up to chat with him. I remember that he walked into my office with the aura of Brigadier Gen. Jack D. Ripper having sent the entire wing in on Wing Attack Plan R. Underneath the bravado though, there was a strong dash of Gen. Buck Turgidson, and in any case, he had no military background and would have been too plump to make PFC if he enrolled. The protruding belly betrayed years spent managing physicists, whose research seems to run on beer as much as it does on taxpayer money that is spent on gadgets that promise to "unleash the secrets of the cosmos and the inner workings of God," with the latter added purely to win Republican support in Congress, but ends up refining a constant by an inconsequential amount.
Once he decisively took a seat in my office, he leaned back, and I could imagine him taking a puff from a non-existing cigar, Ripper-style. This person knew how to pause. He then said "So. You're a systems guy. You must know about systems." I braced for a question from left field, probably involving flouridation and the purity of our precious bodily fluids. "What if I had a graph. A really, really large graph. Billions of nodes. Trillions of edges. Let's say every node on the graph was a person. Edges between people described phone calls, interactions, stuff like that." He paused for dramatic effect, as he mentally took another puff from his non-existent cigar. "How would you find bin Laden?"
The man was literally looking for bin Laden, in my office.
And he was using Big Data to do so, long before it was a trendy buzzword.
It has been no secret whatsoever that the intelligence community in the US has been collecting and collating data at a massive scale. And any techie will tell you that you can't operate at "Google-scale" while remaining compliant with all those pesky laws and regulations that restrict the government from performing domestic surveillance. This data is inherently tainted and inseparably intertwined. Anyone who pretends otherwise has a Total Information Awareness program to sell under a disguise.
So, I knew it, you knew it, and everyone else knew it long before Snowden revealed awfully designed PowerPoint slides that confirmed what we all knew. And thank God for that, so we can now discuss where we are and how to proceed from here.
But to have a rational discussion, let's first dispense with the false indignation and cheap righteousness that seems to run freely in DC these days.
And we failed to take a stand. We consumed reports about "terrorist chatter" without asking how those reports were compiled. We knew that Atta and friends were not hanging out on IRC in the #terror channel, chattering away. In actual fact, they were using messages left in GMail Drafts folders to evade detection, and to get at this information, you either use human intelligence (HUMINT) or signals intelligence (SIGINT). The news reports clearly implied that we were hearing the distilled result of massive SIGINT, and the nation ate this stuff up. So if any Congressperson is going to get indignant now and get upset about what was happening, let me remind them:
We knew it because we were warned. There was no shortage of people who spoke out while the PATRIOT act was being passed that the ensuing culture of surveillance was going to erode values we hold near and dear. Many people urged caution, and noted that we needed to define that which we hold sacred, so we can uphold those principles even as we fight terror. Kind of the way Norway did when tackling the threat Breivik posed to their open society.
But we threw that all away. It was the result of a decades-long process of wussification, enabled by lack of leadership. John Wayne was long dead and gone, the Marlboro Man had died of emphysema, Kurt Cobain had offed himself, and Superman was on a ventilator. There was no one who seemed to possess a pulse and a sense of American principles; in fact, one of the top people in charge lacked both. I cannot overemphasize the damage this overreach did to our identity as a nation. Even from where I occasionally sit, on the admissions committee at an Ivy League university, its impact is pretty clear: the number and quality of our overseas applicants dropped as we squandered our moral high ground.
We knew it because we told them to do it. Massive data collection is something that the intelligence community is explicitly tasked to do. They would be remiss in their duties if they did not actually attempt to collect everything they can collect. This community consists of hard-working, reasonable individuals who have been handed an impossible, ill-defined mandate; namely, protect US interests here and abroad against all possible threats. And the mandate comes from a public that has grown exceedingly soft, one that is all too ready to compromise its core principles. It is no surprise that they will encounter gray areas, and in fact, it is not even a surprise that they will overzealously venture onto not so gray areas.
And we actually need the NSA to perform signals intelligence. Shortly after the Snowden leaks, there were calls for ceasing all SIGINT or defunding the NSA. Anyone who advocates such an extreme position is either not living here or hopelessly naive.
But the boundaries of such data collection have to comply with our values and have to be arrived at through a collective debate. "The Johnsons do it" is no reason to compromise that which defines us, even, or especially if, the Johnsons are Chinese or French. The same establishment that is currently using the French as justification for surveillance were calling them all kinds of derogatory surrendering simian names just a short while ago.
Tigers have stripes and massive data collection of all kinds is what this community does with tax money. This is why checks and balances are necessary.
We knew it because it's in their DNA. Massive data collection is nothing new. It was only a few years ago that the Stasi archives were opened up to reveal "Geruchsproben," carefully catalogued jars containing the smell of their citizens. The Stasi had field agents collect smell samples, sometimes by having the citizens sit on special chairs, or sometimes by breaking into people's homes and literally stealing their underwear.
The obvious reaction is to get outraged and demand to know "under what conditions would the smell of an individual be of any use to an intelligence officer?" As with most obvious questions, this question, its answer ("to sic the hounds"), and the ensuing debate are all a waste of time. One can already imagine the headlines. Wired will discuss, at length, the mechanics of tracking people by scent, with a special highlight on a nose designed out of commercial-off-the-shelf components by the MIT Media Lab that outperforms a hound in carefully controlled laboratory experiments, as long as that laboratory is squarely inside the MIT Media Lab. DailyKos will ring out with indignation, while the National Review will ask why anyone would use perfume if they didn't have something to hide. When the topic is thoroughly exhausted, when it is so universally accepted that the act of speaking up carries absolutely no political risk, mainstream writers like Thomas Friedman will jump into the fray. The only sane voice in all of this will be a short letter by a NYT reader from Kalamazoo, Michigan, pointing out to no one in particular on page C6 that "to sic a hound on a person's scent trail, you need a starting point for the scent, for the hound cannot perform investigative analysis to locate someone, and if you've got a starting point, why do you need the scent sample?"
By the way, speaking of DNA: You can be sure that there are massive DNA databases in cold storage, and it'll take a Snowden Junior for that other discussion to come to light. We know that the intelligence community collected DNA data at the risk of forever tainting vaccination efforts worldwide. How surprised would anyone be if busboys in trendy DC bars near Embassy Row earn an income on the side, swiping saliva from the used utensils of foreign emissaries, or if worldwide bone-marrow registries have been compromised to compile aggregate DNA data for different ethnic groups.
Once again, this industry collects every kind of information when it's left unchecked. All the soul searching and righteous indignation from inside the beltway is cheap, after-the-fact posturing.
For the right question is: what will happen now that the Snowden leaks are public? Having re-directed the emotional angst that this topic attracts, we can now dispassionately analyze how we will find the balance point between national security interests and privacy concerns. But to make any prediction, we need a framework.
There has been a lot of discussion on how the Snowden saga will end. I am referring of course to the actual part of this story that has societal consequences, not to the human interest story around Snowden that the media wants to play up. While the questions that relate to Snowden-the-man are interesting, e.g. "what motivated him to give up his life in Hawaii?", "did he time his leak to undermine the Sino-US talks?", and most importantly, "what is it like being stuck in an airport for life?", they are essentially of very little consequence to the rest of us.
Instead, I want to offer a simple framework for analyzing Internet policy issues, unabashedly cribbed from the Internet visionary David Clark, but amended with a quantitative model that can tell us what to expect. So, let's do some soft science on a topic that involves technology, policy and society.
All online policy emerges as a result of three competing forces. Think of them as vectors whose net sum determines what happens online at the end of the day. Once we sort out the strength and direction of each force, it becomes really trivial to figure out what will happen next.
Here are the forces, their direction, and their magnitude:
This force vector consists of the military, intelligence and political establishment, whose aims are to keep online social movements in check. Its ostensible goal is to control all online interactions. The force vector points in a direction much like mainstream media today: left unchecked, this force will reduce the number of voices, limit how they can be expressed, and eliminate all discourse except for a handful of corporate-approved messages. This is the right arm of Big Brother, with swole fingers from the 5am morning beltway jog followed by crossfit that makes typing out nuanced computer policy difficult.
It's this force vector that tried to ban encrypted communications for decades and tried to make it illegal to export three-liner perl scripts for encryption. It's the same forces that advocated, and still continue to advocate, a "driver's license for the Internet." Everyone who has been on an online forum and engaged in a meaningless online fight has felt like wanting to hunt down whoever is hiding behind that pseudonym. When I ran an online forum where political discussions took place, the main request I received from the older folks was that I "demand users to fax me their identity cards." While most young people have deeply understood that this is neither possible nor desirable, the people who are in a position to make Internet policy proposals have yet to learn this lesson.
So it stands to reason that some of the more aspiring members of this community will want to collect everything everyone does online, keep it in a big datacenter in the middle of the country, and run queries against it to see who's up to what.
And these folks have an enormous budget. Unopposed, they're unstoppable, for what kind of a president can overrule this community, and still count on them to feed him useful analysis later on? The only thing tampering the destruction that this force could unleash is that it historically has not been technically savvy, but this is, evidently, changing.
This force vector consists simply of the collective economic interests of companies that fund elections. And it points in the direction of making the Internet a "pay-for-play" environment, of maximizing revenue extraction from Internet users. It has no ethics or higher goals; it is axiomatic, in our current times, that companies are motivated solely by greed and answerable solely to their own shareholders, regardless of the fact that they rely on countless resources from their surrounding society. This single-minded drive for profit maximization actually makes it easier to analyze this force.
The commerce vector typically operates synergistically with the military/political vector, where one herds the online populace to a few corporate-owned and operated choke-holds, and the other extracts profits. But the commerce force does break ranks with the M/P force, to create more communication channels instead of fewer venues, to create more interactions among users instead of suppressing them, and so forth. And as powerful as the military/political force vector is, and as many dollars as they command, the commerce force vector commands two to three orders of magnitude more. So they handily beat the Military/Political forces every time they point in different directions.
What happened with encryption makes the relative magnitudes of the forces very clear. For decades, it was US policy to prohibit the export of cryptographic algorithms. And the cool geeky t-shirts with those perl three-liners did absolutely nothing to change policy -- the geeks were too few in number, and "freedom to share cryptographic algorithms" was not exactly a cause the lay public could rally behind. But the moment the US computer industry decided that, for its own competitiveness, it needed strong encryption on the Internet, the politicians suddenly discovered that the First Amendment applies to crypto just as surely as it applies to everything else. The transformation was overnight, and it brought us good things, like SSL and online commerce and a host of other developments that make the world a much more interesting, and better, place.
This force vector consists simply of the collective human interests of the people who use the network. It is by far the most powerful force, but has a number of shortcomings: it is slow to awaken, not technically sophisticated, and easy to derail and divide into factions over trivial concerns. But once the giant is awake, absolutely nothing can stand in its path.
What makes the public stand up and take a stance? No one knows. The Arab Spring was precipitated by a street salesman whose cart was taken away by the police, who got so depressed that he decided to put himself on fire, and before we knew it, dictators across many continents were spinning up their chopper blades. The Turkish uprising was precipitated by a couple of trees in a park. Second wave of Brazilian uprisings were over a 10 cent hike. This makes this force terrifying, because when the giant shows signs of awakening, when his eyelids flutter and he's asking questions trying to get his bearings, it's too late.
I propose a simple technique to decide which of these forces will reign supreme based on simple high-school math with a dash of historical analysis.
On issues of Internet governance, I propose the use of dollars as a common, universal unit of strength for measuring societal forces.
to achieve its goals. There are slightly different multipliers for "new funding to be appropriated" versus "already allocated funding," as laying people off meets more resistance than new pork-barrel spending, but a unit multiplier is sufficient for first-cut analysis. Add to this the cost of catastrophes that the policies could prevent, multiplied by their likelihood.
under different policy regimes.
Doing all this quantitatively is difficult, but engineering is all about back of the envelope calculations.
In this case, the first number is simply the sum that the intelligence community spends on eavesdropping at large scale, combined with very tiny likelihoods for events that have modest costs. Contrary to the $30M claim in the Snowden slide deck, the cost of the datacenters and the analysis personnel will likely be in the single to very low double digit billions.
The second sum is the dollar amount that the cloud providers would lose due to direct loss of revenue from antsy customers, especially foreign ones, as well as the indirect costs of losing dominance in their field.
Finally, the last sum is the value people place on activities that cannot take place in a surveillance society. It is almost impossible to estimate this, for we cannot know, say, the dollar figure a dissident would place on being able to blog unfettered, or perhaps, the dollar value Andrew Weiner would place on keeping Carlos Danger's emails private. But a good proxy for this metric is simply the amount of money people plaintively spend on underground activities, a small percentage of which would be curtailed in a surveillance state. Given that our underground economy is roughly 1-2 trillion dollars, even tiny percentages have impact.
So my rough guesstimate is that the forces are aligned in the ratio 1:1:3, with an alliance of the public and commercial interests that overpowers the M/P establishment in favor of transparency and online privacy guarantees.
Let's sanity check: measuring by the highly scientific metric of column inches in newspapers that I have personally seen (yes, I'm fully aware of the bias and write this with a tongue firmly in cheek. I would love it if someone would help do a proper quantitative analysis), I see an approximate 2-to-1 ratio. For every unabashedly condescending, non-self-reflecting, pro-surveillance gung-ho article in WaPo, the beltway insider rag that once attributed the invention of email to a self-proclaimed child-prodigy from MIT who made misleading claims through Wikipedia, there are at least two articles critical of the alleged activities. The cloud companies have already started to feel the sting of lost profits and have initiated a push for transparency. And in line with our estimates, the reaction from the public has been much stronger than the tepid call for transparency from industry.
The numbers suggest that the US will emerge out of the Snowden debacle with a set of processes that prohibit the kind of domestic surveillance that Snowden exposed. But the forces are fairly close, and the victory will be a highly qualified one. We'll get the minimal set of changes such that a figurehead can say "we do not perform domestic surveillance" with a straight face, for a specific definition of every word in that sentence.
For instance, we may be left with loopholes big enough to drive trucks through, say, trucks containing UK data from the US to the UK, and lorries containing US data from the UK to us. Cloud computing makes it trivial to escape pesky jurisdictional obstacles by sending queries that used to go to a datacenter in Utah to, say, Ireland instead. Closing this loophole will prove to be difficult, because information-sharing between different nations is actually desirable, and the legal system is better at binary decisions than those of degree.
Or we may end up with only token changes to the way court orders are retroactively issued. The idea behind the current scheme is that the surveillance engine can collect data now, under hot pursuit, and justify it later. While there is merit to the hot pursuit argument, its unrestricted use garners an environment where results come first, principles are an afterthought, and there are no effective checks against overreach.
Or we may never get what is most needed, which are strict limits on how the data, once collected, is used. We now know that a contractor in Hawaii has access to the entire crown jewels of the NSA, namely, "metadata" about the kind of information they can collect and analyze. How many contractors have access to the lower-value phone call "metadata"? There are now naive proposals to avoid a second Snowden mishap by doubling-up every analyst and have someone look over their designated buddy's shoulder. Besides doubling employment in the DC metro area, this ill-conceived attempt only makes it more likely for data to fall into the wrong hands. What we need are trustworthy processes, backed perhaps by trustworthy operating systems that can provide assurance that no one, not even system administrators, can violate a policy associated with a piece of data. Anything less opens us up to a "deep state," where certain elements within government misappropriate the surveillance data for their own ends.
In big battles, it's instructive to pay special attention to certain smaller engagements that serve as litmus tests. In this case, I expect the "abandoned email loophole" to serve this purpose. The 1986 Electronic Communications Privacy Act classifies old emails left on a server for more than 180 days as "abandoned" and gives the government authority to read these emails without a warrant. This is clearly an outdated loophole, one that would undermine discerning users' and companies' willingness to store emails in the cloud. If the cloud providers really feel the sting of user apprehension about surveillance, and if they really put their weight into fighting on the same side as the public, this loophole would quickly be shut down. Whether or not it is indeed firmly closed will be an indicator of genuine change of surveillance policies.
Overall, it's time to be somewhat optimistic: the fundamentals point to a ground shift, where the commerce and public forces are now exerting an influence on previously unchecked elements in government, and the net vector points towards a freer, better Internet. But there are reasons for concern: the battle will be drawn out, victories will be partial, and the extent to which loopholes get left behind will determine how much privacy we have online.
Back in 2003, the rest of my conversation with my esteemed visitor was a lot of fun. I told him what I'd do to handle such a large graph. And I gently told him that I wasn't building such a graph database. I wasn't then, but I am now. We have a system called Weaver in the works, inspired by the revolutionary HyperDex database. I realize that its utility is much lower now that bin Ladin has been located and rightfully dispatched, but there will undoubtedly be other massive data sources, and we'll need systems that can handle them. For we cannot afford a graph database gap, any more than we can afford a mineshaft gap.
But in addition to such a database, we need tightened definitions for what kinds of surveillance data can be collected, as well as technical and legal measures to keep that data used solely in accordance with appropriate policies. Interestingly, there are technologies that can restrict what users, including Snowden-like "super users", can do with data. Once we re-establish our principles, we have the technical means to enact them. But first, the current era of covert, boundless data collection must come to an end.