Saturday, 12 June 2010

Search engines, data protection & Article 29 Working Party - table summary - data retention, anonymization & the last octet, and hashing

Search data can be very personal. The search engines can get to know an awful lot of info about a person and their doings from their search queries.

I've created a table summarising the current positions of the EU Article 29 Working Party and the search engines on search query data, based on the public Article 29 Working Party letters about the various exchanges between the Working Party and the 3 main search engines Google, Microsoft Bing and Yahoo! (the latest being letters sent by the Working Party a few weeks ago, in May 2010).

You'll see from the table that Microsoft seems to have engaged most with the Working Party (comprising the EU data protection regulators) and Yahoo have gone the furthest, but in all cases the Working Party wants personal data to be deleted ASAP whereas the search engines want to retain it for as long as they can, and also they want to "anonymize" the data rather than delete it completely - usually by substituting another identifier for an IP address or cookie.

So it's not just a question of how long the search engines keep personal data, but also, if they don't delete the data, how well they scrub the data of identifying information, i.e. the quality of the anonymisation techniques used, not to mention transparency about the search engines' hashing or anonymisation techniques. This will clearly be a big issue for the future.

And, as the WP have said, even if you get rid of all other data to do with a query, the search terms themselves can be identifying - e.g. of someone who does an egosearch - and if you can link the search queries made in different sessions to the same person, whether through the searcher having the same IP address, cookie or an artificial substitute ID number as an "anonymous ID", you can still put them together to identify the individual concerned. As the WP said, the capacity to link individual searches may reveal enough personal data to identify an individual data subject.

I've taken the liberty of framing some of the following in the form of "WP said, Google said" for ease of reading - obviously they're informal paraphrases. And I've not included everything from the WP letters, just some key points. (I've outlined IP addresses, last octet, cookies and hashing at the end of this blog, for anyone not familiar with them.)

Article 29 Working Party WP148 opinion 1/2008, April 2008 to search engines Personal data related to search queries is very sensitive, and search history should be treated as confidential personal data.

    Note - see the well known Google video, above, giving away lots of info about a (presumably fictional) person's life - his story, literally - just from his search queries.

The retention period shouldn't be longer than necessary for the specific purpose, and then the data should be deleted.

Even if IP address or cookie is replaced by a unique identifier, the individual can still be identified by correlating stored queries.

Responses from search engines (notably at hearings of the search engines with the WP in February 2009)

Google

Microsoft

Yahoo

IP addresses - "anonymized" after 9 months by deleting the last octet.

    Note - deleting the last octet is like deleting the building number from your street address (building, street, city, country) where 255 other people live on the same street.

Cookies - kept for 18 months.

Search history is immediately de-identified by storing search logs separately from registration data (name, address etc), and search logs are effectively anonymised.

We're willing to reduce retention periods for cookies and IP addresses to 6 months
- if the other search engines do too.

We were going to reduce our retention period to 13 months - but now we'll reduce that to 90 days (with limited exceptions for fraud detection, security, legal obligations.)

IP addresses - last octet is deleted, but for fraud detection a 1 way secret hash is applied to the last octet.

Cookies - a 1-way secret hash is applied to cookies of unregistered users (and to registration identifiers of registered users; then we delete (truncate) 50% of the hashed registration identifiers).

Article 29 Working Party to search engines, October 2009 (see the letters for more details) -

"Anonymisation" isn't proper anonymisation unless it's fully effective and irreversible.

IP addresses - it should be 6 months, and deleting the last octet isn't good enough.

Cookies - retention of cookies allows correlation of individual search queries, and seems to allow easy retrieval of IP addresses for new queries made in those 18 months.

Caches - it seems your caches are updated nowhere near quickly enough; removal tool needs improvement.

Well done on the search history anonymisation, leadership medal to you!

But really, you should reduce your retention period to 6 months whatever your competitors do.

Hashing still doesn't prevent linking / associating different searches.

    Note - if you hash my user ID (e.g. smith) to change it to e.g. h5s, you'll still know my searches are by the same person, i.e, h5s, and you can still correlate them.

50% deletion may not be good enough anonymisation.

How much and for how long are search data kept for litigation or legal obligation purposes?

Search engine response to Article 29 Working Party - these represent the current position of the search engines as at the date of writing

Silenzio, it seems.

So, still the same as before - deletion of last octet of IP address after 9 months, and deletion (or is it just "anonymisation"?) of cookies after 18 months.

OK.

Immediately after a search - we'll de-identify cookies by applying a 1-way hash.

IP addresses - will be deleted after 6 months.

Cookies - will be deleted, together with other remaining cross session identifiers, after 18 months.

    Note - it's said the de-identification procedure and hash are applied to cookies after 6 months (for registered users (if logged in at the time, presumably) or 18 months (for unregistered users), but that seems to contradict de-identification "immediately after a search"?

Fine.

IP addresses - will now be deleted in full, after 90 days - not just the last octet.

Article 29 Working Party to search engines, letters 26 May 2010

Google dominates the EU search market, with 95% market share in some countries.

Fair and lawful personal data processing by search engines is increasingly crucial given audiovisual data and geolocation.

Google's "apparent lack of focus on privacy in this area is concerning".

Using an "anonymous ID" still seems to allow cross-matching of search queries for a long time.

Hashing techniques - you haven't given us sufficient info to assess the technical quality of your anonymisation policy.

Deletion of only part of the data isn't true anonymisation.

And you've not given sufficient details on the hashing, especially of user identifiers and cookies.

You're all still not compliant with data protection law.

You should review your anonymisation claims and make the anonymization process verifiable, preferably by developing a credible audit process involving an external and independent auditing entity.

Microsoft & Yahoo haven't given enough info about their techniques for hashing user identifiers and cookies in order for us to assess how effective their anonymisation policy is. (Google don't seem to hash at all.)

The actual anonymisation techniques used deserve open debate and public scrutiny in light of now well known anonymisation failures [i.e. where supposedly anonymised data was successfully linked back to identified individuals - see more on anonymization techniques].

We're asking the US FTC to examine your behaviour under section 5 Federal Trade Commission Act (unfair or deceptive practices). And we're copying this to the European Commission Vice-President in charge of Justice, Fundamental Rights and Citizenship.

I imagine most people who read this blog will be familiar with IP addresses, cookies and hashing, but for those who aren't, I'm planning to write detailed blogs in due course. Meanwhile -

IP address

The internet address of your computer, or strictly your router (the box provided by your ISP to connect to phone line, cable). It's automatically recorded by websites like search engines when you visit them. Often your IP address is assigned dynamically by your ISP, e.g. BT or Virgin, so it could (but needn't) change each time you dial up or disconnect and reconnect. If you have a permanent fixed IP address, which usually you have to pay more to your ISP for, then by definition it shouldn't change.

So a website could identify you from your IP address. With the twist that obviously different people might use the same computer, or use different computers connected to the same router, which means that different searches made from the same IP address could in fact be made by different people from the same household or business, rather than by the same person. Alternatively, of course, they could still be by the same person…

An IP (version 4) address consists of a series of 32 binary digits (bits), i.e. 32 ones and zeroes in a row - 4 groups of 8 numbers each, hence the references above to "octet". Example - 11010001.1010101.11100011.10010011

A slightly more human-friendly version of an IP address breaks it up into 4 decimal numbers in a row, separated by dots, in what's called dot decimal notation. Many people will have seen an IP address in this form, e.g. 192.168.1.254. Each of the 4 dot-separated numbers can't exceed 255 (as the biggest octet possible, 11111111, equals 255 in decimal).

With an IP address like 209.85.227.147 (which is 11010001.1010101.11100011.10010011 in binary), deleting the last octet would involve deleting the 10010011, i.e. the 147.

Cookie

A file that can be stored on your computer via your web browser when you visit a website. The site decides what info to store, and it can if it wishes allow third party sites (typically advertisers) to store cookies on your computer too. Cookies can be persistent and last across different visits to the site, even on different days. (Some sites set the expiry date to a ludicrous one like 99 years away!)

When you visit the same site again later it can read the info contained in the cookie it stored, so, quite independently of IP addresses, cookies can enable the site to recognise that the same web browser is visiting it again and, depending on the info you gave the site previously and the info stored in the cookie, can even identify you personally. I.e., cookies can be personal data.

An advertiser can retrieve its cookie if you browse to another site that has the same advertiser's ads, even if the cookie was stored by the advertiser on your computer while you were visiting a different site. So advertisers can track your visits across different websites.

I'll leave flash cookies and the like to another blog.

Hashing

Applying a particular mathematical process or procedure to something, e.g. a name or a file, to produce something else, usually a shorter unique series of letters and numbers.

Here's a highly unrealistic example. Say my extremely original and exciting hash for a name involves counting the number of letters in the name, taking the 1st letter and last letter, and then combining them in the order: last letter, number of letters, and 1st letter. Applying this never seen before (nor likely to be seen again) "T&L hash" to the name "smith", I'd get "h5s".

Of course in practice it's much, much more complicated and cryptic (and effective!) than that, but you get the drift.

A "secret" hash simply means, well, it's a secret isn't it. The process used is kept secret.

And "1-way" means the procedure is irreversible, you can only go 1 way, you can't work out from the result what the original name (or whatever) was. From "h5s" you might, if you knew the process I used, figure out that the original name was "s---h", but that won't give you the middle 3 letters. (Though you could guess them based on your independent background knowledge if you knew it was an English surname, of course.)

©WH. This work is licensed under a Creative Commons Attribution Non-Commercial Share-Alike England 2.0 Licence. Please attribute to WH, Tech and Law, and link to the original blog post page. Moral rights asserted.