The Science of Privacy
As my job title (Data Scientist) may suggest, I wear my geekiness proudly. Among my favorite things to think about these days are “constrained logistic regression” and “bias-variance tradeoffs”. It comes with the territory. After all, building predictive models for an ad tech firm that uses consumer data to make ads more relevant requires a certain level of curiosity.
With data at the crux of most ad tech firms, concerns over its use and consumer privacy has naturally been widely discussed. While I certainly can’t claim to have the final word on this ongoing debate, I can bring some interesting insight to the topic — the science of privacy (yes, even privacy can’t be spared my geekiness). Understanding this science should be a part of the debate because the right data processing and storage techniques can preserve user privacy without squandering valuable signal. I’ll elaborate with examples.
The first trick is less specific to privacy preservation and more about good data science. It involves knowing how to quantify the value of data. Bigger isn’t always better because not all data is created equal. Determining the importance of different pieces and types of data helps make its usage much simpler. At m6d, we optimize our models using cookies that don’t store PII (Personally Identifiable Information). And guess what? It works just fine. A good data policy is to avoid storing data that isn’t useful, and luckily PII isn’t really that useful for a lot of predictive modeling tasks.
Another trick is to hash or encrypt the data before as it streams into your system. Encryption is always used by firms that have to store sensitive information (like credit card companies), and it can also be used in advertising without reducing targeting effectiveness.
Let’s say your data stream includes the purchase history of a user and the websites that user has visited. It can look like this:
A good predictive model can learn that people who visit www.togetherforever.com and www.iloveher.com are likely shopping for an engagement ring. With that knowledge, you can certainly select some well-targeted ads. But of course those targets might not want anyone to know about this. An alternative representation of the same data could be:
1 931:1 6076:1 551:1 64:1 974:1 1206:1 14638:1 2395:1 638:1 14839:1 2716:1 12827:1
This representation is an example of data that is hashed. To the general public (including myself), this text, as it is written, is worthless. If constructed correctly, the algorithms that generate predictive models don’t care that your data is hashed, so predictive performance isn’t ever sacrificed. And then with respect to privacy, there is only upside.
A third trick is to aggregate data so that you can group users with similar attributes or compute summary statistics for entities that are of interest. In the technical literature, there is a concept called k-anonymity, which uses these techniques to decrease the probability that you could exactly identify a specific user given data about them.
Although the discussion and examples above are for web browser activity, the same rules apply to data collected from mobile users. Geo-location data is likely to be both very predictive and controversial with regard to acceptable use. Census data is a good example of location-attribute data that has been aggregated to preserve privacy. Geo-location data can be encrypted like any other data, and employing this trick allows businesses to utilize it without the risk of exposing what could be sensitive information.
We fully acknowledge that the use of these techniques won’t guarantee that the privacy debate is over. For every popular method that transforms data to establish anonymity, there are hackers and comp sci pros who — legitimately and otherwise — spend time and energy trying to break the code. Data abstinence is perhaps the only foolproof solution to preserving privacy. Most people realize that using data to provide better information and better services is ultimately a positive endeavor, so abstinence isn’t the likely outcome. As we do use data to create value, it’s important to remember that a lot can be predicted about a person’s actions without knowing anything personal about them. Recognizing this paradox is an important step toward gaining public trust and confidence, and helping the industry solve other important issues (don’t get me started on attribution!).
Brian joined m6d as head of the data science team in September 2008. He has led the development of m6d’s patent pending machine learning technology, as the company has gone from pre-revenue to supporting 100s of concurrent client campaigns. His current research interests include building autonomous machine learning systems over big data architectures, causal inference and influence attribution. He recently served as co-chair of the 2012 KDD Cup competition. Prior to joining m6d, he was a Senior Research Analyst at Meetup.com, and a credit risk modeler for American Express. He holds an MBA with a concentration in Statistics from NYU and a BS in Mathematics and French Literature from Rutgers.