Wednesday, 4 August 2010

Behavioral tracking / de-anonymisation reality - WSJ article

Privacy nightmare? Lots of personal info is deducible from 1 click, Wall Street Journal article On the Web's Cutting Edge, Anonymity in Name Only (a must read by Emily Steel & Julia Angwin) just reported -

"From a single click on a web site, [New York ad company] [x+1] correctly identified Carrie Isaac as a young Colorado Springs parent who lives on about $50,000 a year, shops at Wal-Mart and rents kids' videos… deduced that Paul Boulifard, a Nashville architect, is childless, likes to travel and buys used cars… determined that Thomas Burney, a Colorado building contractor, is a skier with a college degree and looks like he has good credit.

The company didn't get every detail correct. But its ability to make snap assessments of individuals is accurate enough that Capital One Financial Corp. uses [x+1]'s calculations to instantly decide which credit cards to show first-time visitors to its website…

…firms like [x+1] tap into vast databases of people's online behavior—mainly gathered surreptitiously by tracking technologies that have become ubiquitous on websites across the Internet. They don't have people's names, but cross-reference that data with records of home ownership, family income, marital status and favorite restaurants, among other things. Then, using statistical analysis, they start to make assumptions about the proclivities of individual Web surfers.

…A Wall Street Journal investigation into online privacy has found that the analytical skill of data handlers like [x+1] is transforming the Internet into a place where people are becoming anonymous in name only."

Kudos to the WSJ for their investigation. The article explained the technology further:

"A visitor lands on Capital One's credit-card page, and [x+1] instantly scans the information passed between the person's computer and the web page, which can be thousands of lines of code containing details on the user's computer. [x+1] also uses a new service from Digital Envoy Inc. that can determine the ZIP code where that computer is physically located. For some clients (but not Capital One), [x+1] also taps additional databases of web-browsing history.

Armed with its data, [x+1] taps consumer researcher Nielsen Co. to assign the visitor to one of 66 demographic groups.

In a fifth of a second, [x+1] says it can access and analyze thousands of pieces of information about a single user. It quickly scans for similar types of Capital One customers to make an educated guess about which credit cards to show the visitor."

See the WSJ's detailed predictions for different testers (including some of the code sent, what personal data [x+1] got right and what they guessed wrong), and also the WSJ's What They Know page generally.

I'd like to know though exactly what they meant by "containing details on the user's computer". It rather looks as if the code is generated through scanning the cookies (and Flash cookies etc etc?) saved by different advertisers on the user's computer from their prior visits to other websites.

The article also pointed out that the algorithms' evaluation of one tester came "extremely close" to identifying him personally. EFF staff scientist Peter Eckersley worked out that the tester's location (a small town) and his Nielsen demographic segment together gave 26.5 bits of info about him, meaning they'd know that he had to be one of only 64 possible people in the whole world. Add just one more bit of info, like his age, and they could probably totally de-anonymise him, i.e identify him precisely. 

(I'll blog "bits of entropy" properly another time, see privacy researcher Arvind Narayan's short explanation on why he calls his blog, another must-read, "33 Bits of Entropy". And this blog. There are 6.6 billion people in the world and log2 6.6 billion is about 33, if you must know!)

I shall just leave you with the unforgettable image conjured up by this screenshot of the article, dear readers (increasing my collection of funny typos) -



I, too, can make predictions about people based on if they lick a website. But I'd rather not.

©WH. This work is licensed under a Creative Commons Attribution Non-Commercial Share-Alike England 2.0 Licence. Please attribute to WH, Tech and Law, and link to the original blog post page. Moral rights asserted.