The NSA story is still breaking, so almost anything I write may soon be overtaken by new information, but I thought I’d address one aspect that involves an area in which I have some expertise: Although whistleblower Edward Snowden claims that major tech corporations such as Microsoft, Google, Yahoo, and Facebook have given the NSA direct access to information on their servers, spokespersons for the corporations are denying it. They are saying that they comply with legal government requests for information, but that they do not give the government “direct access” to their servers.
If we make a few technical assumptions about some missing details, it’s possible both sides are telling the truth.
When we use sites like Facebook, Google, and Yahoo, we are for the most part viewing and modifying only one thing at a time: One user account, one fan page, one news page. The pages might have some details — postings or timeline items — but they are organized around a single conceptual entity. Furthermore, our need to modify this entity imposes an architectural limitation requiring every entity to be stored in one and only one place, i.e. on one server or in one database, otherwise you could have different versions of the entity in the system, and you’d see different pages depending which server or database you happened to reach on each visit.
On the other hand, since users only work with one entity at a time, it is relatively easy to scatter all the entities in the system across many servers. A site with a 100 million users could have 4000 servers, each storing information for 25,000 users. When any single user is accessing the site, he or she would be mostly interacting with whatever server their user profile happens to live on. This sort of partitioning is a common way of scaling up an application to support very large numbers of users.
Now suppose that you are the company running this site and you want to find out some aggregate information about your users. Perhaps you want to find all users living in Chicago. With the data architecture I described above, you would have to submit a query for a list of all users in Chicago to all 4000 servers, collect the answers, and combine them into a single list.
One problem with that approach is that the interactive servers are optimized to respond quickly and efficiently to requests for single users, so a query that could return hundreds of users will push the server outside its high-performance envelope, which could cause glitches that users would notice. The problem would be even worse for a more complex query, such as an advertising department query to find all users who live within 15 miles of any of a clothing retailer’s 75 store locations. Implementing the ability to handle such queries efficiently would likely require the creation of additional data structures and indexes, which would impose their own performance burdens and equipment requirements.
A second problem is that querying 4000 servers is a complex and time-consuming operation, and some of the data could be missed if any of the servers happen to be down when the query comes in. The problem becomes even worse for queries with results that span multiple servers. For example, the NSA might have a target user and want to know all users who exchange messages with that user or with any user who exchanges messages with that user, requiring a series of queries spreading across dozens or hundreds of servers. Or maybe not all user data is stored on one server, e.g. for performance reasons it might make sense to store user messages on separate servers from user profiles. Performing queries across multiple servers can get very messy and hard to implement.
The root cause of all these problems is that a server and database architecture that is optimized for thousands or millions of interactive users is unlikely to be efficient or effective at broad ad-hoc queries across the entire user base.
Information technology companies have been dealing with these kinds of problems since long before the modern internet, and there is a well known general architectural pattern that has proven effective: Build a separate database that is specifically designed to efficiently handle large ad-hoc queries and populate it with data pulled from the interactive servers on a regular basis. Depending on the application, this could be done on a nightly schedule, or whenever certain events happen, such as every time a user updates his profile or sends a message. In large or complex data environments, there is often a whole separate set of intermediary servers to perform the distribution process.
(By the way, you likely encounter variations of this architecture all the time. If you use Google Analytics, it’s probably why you can’t get statistics more recent than the previous day — user clicks from all over the world are collected on Google servers scattered all over the world, and it takes time to consolidate them in a single place. This is also why ATM and credit card transactions don’t show up immediately in the detail view of your account on your bank’s website — the servers that provide fast responses to the ATM network and point-of-sale terminals haven’t yet transferred the data to the reporting-oriented servers that provide the data on the website.)
The concept of separate systems for ad hoc queries and reporting has been around a long time and has many variations and names — data warehouse, reporting server, analytics server, data mart. The latter is a collection of multiple variations of the reporting data, offering different subsets of the full data set to different types of end users — one subset for company executives, another subset for quality monitoring, and a whole slew of subsets for various advertisers looking to target lucrative subsets of the user base.
Given the prevalence of this kind of architecture, it seems likely that if the NSA approached companies like Microsoft, Facebook, Yahoo, and Google with requests to access user data, the companies’ technical response would be to setup a data mart to meet the NSA’s needs.
There are several advantages to this approach. First, it’s a familiar process that the companies can easily do. It’s certainly easier than giving the NSA access to the live user servers, especially since data mart maintenance software already exists for all the major databases.
Second, compliance analysis is simpler. If there are varying rules governing what the NSA can see — e.g. full message content for some users, only message metadata for others, depending on laws, procedures, and court rulings — there’s no need to have the query server software analyze every NSA query and filter the results. Instead, the filtration rules are implemented in the distribution mechanism that populates the NSA data mart, which then only contains the approved data. The filtration rules are likely created with a software tool that is designed to make them easy to create and check for correctness.
Third, this process clearly delineates what the NSA has access to and what it does not. If it ever comes to a court case or a Congressional investigation, the companies can point to the server to show what they turned over.
From the point of view of the IT staff running the websites for those companies, neither the NSA nor anyone else can perform queries on the thousands of servers holding user data — they’re simply not designed to do that. The employees would know about the data being swept to the data warehouse every night, but the NSA data mart is only a small portion of that, and the data being transferred is presumably only what is required by law (at least as far as they know). Given the requirements of national security, most of the IT staff would not know the details.
To the NSA employees querying this data, however, the details of the data mart implementation would be unimportant, and as far as they’re concerned they appear to have access to the servers at Facebook, Yahoo, and Google — much as it appears to you that you are accessing your bank’s computers when you view your account transaction details on your bank’s website, even though you’re probably just accessing a delayed copy of the data. To NSA analysts poring over the data for signs of terrorist activity, the distinction between the different types of servers would be immaterial, and would be unlikely to be included in the training materials that have come to light recently.
I should note that this is only speculation on my part. However, if these data marts are the mechanism by which the companies comply with FISA warrants, it would explain some of the confusion.
Addendum: Mark Jaquith has posted a similar theory of what’s going on, and he points out that it corresponds well with this story in the New York Times. Also, the reports that PRISM has an annual budget of only $20 million, make a lot more sense if PRISM is just the NSA’s program for aggregating data pulled in from the corporate FISA compliance data marts.
Leave a ReplyCancel reply