Data Loss Prevention: What a Whitecode and VCR Have In Common


Let’s get a couple of obvious things out of the way. First, we all know that data leaks are bad. And second, we also know that they happen all the time. According to the Identity Theft Resource Center breach report, the number of data breaches in 2015 for the US totaled 781.


What you may not know is that outsider hacking attacks accounted for less than 40 percent of all breaches.

data breaches graph

Figure 1: Data Breach Incidents – By Type; Source: Identity Theft Resource Center 2015 Breach Reports

If we take a look at this chart from the ITRC’s report, we can see that the other 60-plus percent is due to employee error, accidental e-mail, insider theft, physical theft, subcontractor/3rd party, and data on the move (e.g. losing a hard drive with sensitive information).

Various regulations such as the Sarbanes-Oxley Act, the Gramm-Leach-Bliley Act, or the Fair Credit Reporting Act govern how companies monitor, maintain and handle the sensitive information, that they obtain from their clients and customers for the purposes of doing business with them.

The Payment Card Industry (PCI) for example, outlines the standards by which to encrypt credit card numbers, while the more financial-orientated Securities and Exchange Commission Rule 30 of Regulation S-P requires that financial institutions adopt written policies and procedures that address the protection of customer information and records.

So how can you protect yourself and your data?

Here to help are some guidelines for a DLP solution, which can greatly reduce the risk of these kinds of breaches, protect your sensitive data, and fulfill your regulatory obligations.

Whitecodes are not your friend.


It’s fairly common knowledge that, at the most basic level, DLP and communications monitoring software work by scanning content, be it the participants, subject, body, or headers of an email or the body, properties, and even the alternate data streams of documents or files. When the content is scanned by the software, it’s typically stripped down to its raw text which largely ignores factors such as font, style, the color of the text, and in some cases formatting.

Whitecodes are an early method of data loss prevention that take advantage of the programs that view the content in its raw text format.

All one has to do is to type out a unique keyword, string, or code in one point font, change the color of the text to match the background, effectively rendering it invisible to the human eye, and place it somewhere in a document such as a credit report request, tax forms, or anything that could contain sensitive information. Then you just tell your system to look for that code and voila! Now you can see whenever someone distributes the document, problem solved!

Now, before you go run off and start marking up your important documents with these it’s important to note that, while this might seem like a reasonable solution, there’s one major problem with this approach in that it isn’t actually looking for anything.


And by that I mean, it’s not looking for the sensitive data itself.

Let’s say we have a credit card number on a document, and hidden somewhere in that document is the whitecode “WHITECODESTRING0000001234\Secret”. What would happen if someone decided just to copy that card number out of the document and paste it into an email, an IM, or into a different unmarked document?

This is the inherent issue with using whitecodes. They don’t search for the actual information that we want to identify, which makes getting around them relatively easy. We don’t care about a single document that might have a credit card number on it, we should be focusing on capturing the credit card numbers themselves.

Pattern Matching using Regular Expressions.


A Regular Expression is a way of defining and identifying patterns, characters, and structures in the text. When we talked about putting a whitecode into your DLP solutions, there’s a pretty good chance that the DLP solution that you’re using works by entering these regular expressions.

Used for its powerful searching capabilities, regex is able to easily identify things like Social Security numbers by looking for patterns where there are 9 numbers, either in a row or formatted with or without dashes in the standard SSN format (xxx-xx-xxxx), and where the first 3 numbers in the sequence are those used by the social security administration to identify the state in which the individual was born.

Using this method, regex allows capturing numbers like 055-74-1022 but not 987-85-1234, because 987 is not a number string that is assigned to a US state.

While this method is better in that instead of looking for generic communications which may or may not contain sensitive information, we’re looking for the sensitive information itself.  Regex is still limited in its ability to account for the various security features built into certain number strings.

Let’s look at Credit card numbers again. They have a six digit Issuer ID Number, a variable number up to 12 digits to identify the individual account, and a single check digit at the end for validation, which results in a single 12-19 digit number.

Think about that for a minute.

That’s up to 13,882,719,914,324 possible sequential number strings regex could be looking for.  But even having a solution that could allow you to validate the check digit, it doesn’t change that number much. Also, the above isn’t taking into account the number string being formatted (xxxx xxxx xxxx xxxx).

If one attempts to capture every message or file that could potentially contain a valid number string, the potential for false positives is extraordinarily high. Assuming that someone is reviewing whatever is captured, the volume would make finding any actual violations unlikely.

We need to filter out the noise.

Identifying violations in context.


Whichever solution you choose should provide you with the ability to fully customize your risk policies. Prepackaged or out-of-the-box policies might seem to be the answer, but they are often just templates which need to be customized to your environment.

Let’s take a look at the Credit Card regex from before. What are some ways in which we can capture violations, and reduce the amount of noise?

Well, what if we searched for indications in the communication of file that would tell us it’s a credit card number? We could say “look for this number around the mention of an expiration date, a three-digit CVV, or the mention of an issuer brand such as ‘Visa’ or ‘Mastercard’”.

If you received an email only containing “4427890001505321”, it wouldn’t make much sense to you. It might be a 16 digit number, which can be validated with a Luhn check, but there’s no context so there’s no way to determine if it’s a violation or not.

In order for you to know that it’s a credit card number, it would have to also contain information that would let you know what it is. For example, if the message contained “4427890001505321 exp 10/19 John Doe CVV 523”, we can tell that they’re talking about a credit card number.

Knowing this, we would customize our policy to look for a valid number, a date with or without mentioning “expiration”, and a three-digit number in close proximity of the term “cvv”, this message would be captured, and the appropriate actions could be taken.

This is just one example but there are numerous pieces of sensitive information, such as Social Security Numbers, Account Numbers, or Medical Records, which can be caught more effectively using this method of capturing violations in context.

It’s important to mention that there is no fool-proof technique or magic formula to ensuring that your data is safe, but that’s not to say that some methods aren’t better than others. It all just depends on what your needs are. Just keep the following things in mind, when it comes time for you to implement a solution to protect your data:

There’s no such thing as a perfect one-size-fits-all policy that works for everyone, so the DLP solution you choose should allow you to customize and tailor its risk policies to your environment.

When you do customize your policies, make sure you’re looking for the actual information as opposed to a whitecode on a document that might contain sensitive information.

Reduce your false positives, and make your review of the events more manageable, by looking not just for the information itself, but for the surrounding context which makes it a violation.

Cover all your bases. The solution you choose should be able to protect your data across a variety of channels such as data-at-rest, data-in-motion, web, in various applications, and email.

Lastly, make sure that you update your solution and policies regularly. In addition to hotfixes, patches, and software updates, which ensure that your system is functioning the way it should, being proactive in updating your policies to capture the latest lingo will make or break the effectiveness of your solutions. In a few hundred years, our dynamic language has gone from “HWÆT, WE GAR-DEna in geardagum, þeodcyninga þrym gefrunon” to “omg lulz”. That kind of rapid evolution makes the constant effort of keeping the language of your policies and your DLP methods up to date, vitally important.

After all, how embarrassing would it be to see the thing “protecting” your companies most important and sensitive data, being compared to a video player from the 90’s?