Context Clues: Making Your Surveillance Policies Work Smarter


President, package, exposure, raid, government.

What do these words mean to you?

Well, according to the NSA, they mean that you might be a terrorist. Released in 2011, as part of their Analyst Desktop Binder, the NSA revealed a list of 377 keywords that they use to monitor for any potential threat. If you go through the list, you’ll notice words like “recovery”, “response”, “Cloud” and many others that are similarly generic in nature.

Now, given the context, those words start to make more sense, right?

When it comes to surveillance of any kind, for the purposes of national security or financial regulatory compliance, the goal is to narrow down your full set of data, by flagging items matching certain criteria, into a more relevant and manageable set that can be manually reviewed. Because while reviewing every single message, post, text, or call, with an equal level of scrutiny, would allow you to capture every possible violation, it’s simply not practical.

One, if not the most common, way of flagging items for review is by searching for keywords and key phrases associated with what is being searched for, a practice that does little in terms of producing a useful set to be reviewed.

The large problem with flagging items based solely on keywords is that they’re typically nonspecific, resulting in an abundance of false positives. Looking at that list of words from before, we can ascertain that, by looking for “president”, they intend to intercept any and all potential threats to the POTUS but we only know that when given the context used by the NSA.

The same goes for the system you use to search your data. In order to flag relevant items, and produce a manageable set to review, you need to provide your system with additional context around your keywords.

If we scale back the ambitious data set used by the NSA to just the communications of your company, it can give us a better idea as to the kind of results we would get, if we were to create a policy to capture just the keyword, “president”?

The first thing we’d probably notice is the number of events with a signature containing the president or vice president of X, which caused the policy to fire. We’re also likely to see just as many news articles mentioning President Obama, the Presidential race, or company presidents. In addition to that, random things like discussing popular TV shows or movies about a president, or being president on a committee, would make a manual review of the results virtually impossible. The sheer volume of data would make spotting the Loch Ness Monster in Minnesota more likely than finding an actual violation.

So what’s the solution?

At the start of this article, I listed five seemingly random words that didn’t mean anything to you until you read further. But what if I started it like this?

“(Attempt OR Threaten OR Take Down OR Eliminate) [Anywhere from 0 to 5 words] President”

“(Anthrax OR Bomb OR Explosive) [Between 1 to 3 words] Package”

“(radiation OR hazard OR disease) [0 to 3 words] exposure”

“(drug OR military) raid”

“(Cripple OR Destroy OR Take Down OR Eliminate) [1 to 5 words] government”

Had I done that instead, you would have had a better understanding of what those words meant? Rather than drawing your understanding from me explaining “this is what the NSA looks for”, you’d have gone from questioning what they were to questioning why they’re there. That is to say, mentioning the NSA would have served as an explanation as to why I was talking about terrorist activities, as opposed to the meaning of the words themselves.

Let’s go back to the policy we created at your company. If we put our keyword into context and searched with a string that looks more like this,

“(rumor OR secret OR don’t tell anyone OR will step down as) [1 to 3 words] President”

“President [3-8 words] (bankrupt OR will announce OR resigning OR merger)”

We can tell, before we even look at them, that our results are going to be more relevant. This new version will trigger on items such as “The president is going to say the company is going bankrupt”, “President will announce something big tomorrow”, or “Don’t tell anyone that the President is leaving the company”.

This also answers those who may have been wondering what is the purpose of capturing someone mentioning ‘president’ in your firm, because in addition to providing context to the keywords it captures, this type of policy also provides context to the purpose of the policy itself. A context-based policy not only improves the quality and relevance of the events you review but also serves to establish a clear intent of what the policy is meant to achieve. You can see in the example that we’ve established the intent to capture rumors, secrets, or tips which can be used for illegal front-running and insider trading.

The examples provided here are by no means intended to be comprehensive policies and they aren’t fit for any real-world application, but they do serve as good starting point and do well to demonstrate the concept behind the types of policies we create.

The reality of the situation is that, while you can never fully eliminate false positives, implementing policies that go beyond flagging keywords can help you to focus on catching any real potential threats to your company.

Rather than just identifying words, we strive to ensure that our approach to designing and building surveillance policies identifies the core concepts and language of the violations themselves. So if that sounds in any way appealing, let us know how we can help you.

And if the NSA should ever happen to find this post in its review queue, we’d love to help you out too.