Friday, 8 January 2010

The Data Dozen - Identity Management for Privacy

Here's a suggested list, the "Data Dozen", of 12 minimum essential requirements for proper privacy-protecting identity management systems. These are just some thoughts - any views would be very welcome.

The Data Dozen of Identity Management for Privacy

  1. Correct base information - verification of source info.
  2. Good identifiers.
  3. What data is disclosed - data minimisation and granularity of user control.
  4. True, informed consent.
  5. Secure transmission and storage of the data.
  6. Proper constraints on access to disclosed data and secondary usage of data.
  7. Retention and destruction of data when "expired" or on revocation of consent etc.
  8. Accountability and transparency.
  9. Better anonymisation techniques, suitable for the individual purpose.
  10. Checks and balances (both legal and technological) in relation to data aggregation, data mining and profiling and similar techniques.
  11. Improved usability, awareness and education of users and designers & developers.
  12. Secure and safe delegation of a user's rights.

All of these need to be supported by suitable laws and business processes/practices as well as technical standards and technology - bearing in mind the policy decisions to be made and the constraints (on which see e.g. the OECD Working Party on Information Security and Privacy's paper The Role of Digital Identity Management in the Internet Economy: a Primer for Policy Makers 2009, and At a Crossroads: 'personhood' and Digital Identity in the Information Society 2007).

But as mentioned previously in relation to privacy enhancing technologies (PETs) generally, I think that truly privacy-preserving identity management systems are a pipe dream at the moment - unless and until people are prepared to pay for them, or their use is compelled by laws and regulations which are supported by adequate penalties for breach and whose compliance is properly monitored, policed and enforced.

And now in more detail.

1. Correct base information - verification of source info

Authorities need to verify properly the correctness of the information based on which they will be issuing identity credentials. This should go without saying.

It's no good having official credentials which verify the identity of the wrong person or bear the wrong personal details. In this regard it's not just tech, it's people and processes too.

While nothing can be 100% perfect, it's garbage in garbage out and things could and should be improved in this regard - especially if certain forms of identification, whether online or offline, are to be made effectively compulsory such as identity cards.

A very readable paper pointing out some of the issues and problems in this context (and others) is James A Lewis's Authentication 2.0 - New Opportunities for Online Identification, aimed mainly at policy makers.

2. Good identifiers

Adequate identity data, where it is generated by authorities or other organisations - e.g. identifiers which are to be used linkably as means of uniquely identifying individuals (or other entities or resources), such as ID card numbers or social security numbers, need to be secure, and generated in a secure way.

Unlike the US social security number, which was recently found to be guessable from knowing date and place of birth! (See the widely reported paper "Predicting Social Security Numbers from Public Data," by Alessandro Acquisti and Ralph Gross in 2009).

Use of identifiers is another issue, discussed at 9 below.

3. What data is disclosed - data minimisation and granularity of user control

Minimal "need to know" disclosure and fine grained user control over what is disclosed (rather than the current "all or nothing" approach) need to be mandated, via both law and technology - in the context of the purpose of the requester in seeking the information, and whether it's truly needed for the transaction in hand - with support for partial identities and pseudonymity, and with negotiation of disclosure & privacy policies.

4. True, informed consent

There need to be mechanisms for proper informed consent, opt in rather than opt out, which ensure it's true consent not engineered consent and which support revocation of consent (see the EnCore work including their publications).

In relation to engineered consent it's well worth reading Soft Surveillance, Hard Consent: The Law and Psychology of Engineering Consent from the excellent (free to download) Identity Trail book.

What's "engineered consent"? If you have to give more information than is strictly necessary to access a web site or service or to buy a product online (e.g. your home address, which really isn't needed if you're buying software via download), then it is likely that you will provide the information (and consent to whatever broad uses of your personal information that the site requires - the box will probably be pre-ticked of course!), simply in order to achieve your immediate goal of accessing the site or getting the goods or services you want.

And then very likely you won't get around to going back to the site later to revoke or withdraw your consent, and/or you will later decide that actually you were happy to give your consent in the first place after all, for reasons to do with human psychology (decision theory's discounted subjective utility, prospect theory and cognitive dissonance - see Soft Surveillance, Hard Consent: The Law and Psychology of Engineering Consent which explains those concepts).

So in many senses your consent - whether in relation to the initial disclosure of private data, its ongoing use or the withdrawal of consent - has been "engineered", and is not true, free consent.

5. Secure transmission and storage of the data

A given. And ideally allowing anonymous or pseudonymous communications.

6. Proper constraints on access to disclosed data and secondary usage of data

i.e. access control and the like - including considering properly how the constraints are determined, communicated and enforced within an organisation - and with properly enforced data handling and data release policies that are sticky.

7. Retention and destruction of data when "expired" or on revocation of consent etc

Provision for self-destructing data (along the lines of Vanish) when no longer needed for the purpose consented to.

Safeguards and mechanisms for securing or destroying the data are also needed in the event of the change of ownership, sale of business/assets or insolvency of the data-holding entity (consent to one entity doesn't imply consent to another entity, but that is currently assumed). Recall the problems after Verified Identity Pass, which ran the travel security service CLEAR in the USA and held biometric etc information on many US citizens, became insolvent.

8. Accountability and transparency

- through logging and ideally the ability for users to check what's held about them and how or by whom or for what purposes it has been accessed - i.e. systems oriented toward information accountability and appropriate use.

See e.g. the Galway report.

9. Better anonymisation techniques, suitable for the individual purpose

- for medical records etc disclosed or accessed for legitimate scientific research, to ensure that anonymization is done as far as possible without (by allowing identification through re-linking) compromising confidentiality or privacy.

Reidentification is increasingly topical lately. In a well publicised paper in August 2009 on the risks of re-identification ("Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization"), US engineer turned lawyer Paul Ohm detailed some well known situations where re-identification had been relatively easily achieved. (And see also An Overview of Techniques for De-Identifying Personal Health Information.)

One famous example involved online movie rental company NetFlix publicly releasing massive datasets about their users for a competition to improve their movie recommendation algorithm. From the data researchers Arvind Narayanan and Vitaly Shmatikov were able to reidentify users including their politics and sexuality.

Despite all this, Netflix are planning yet another similar contest, and are being sued by a lesbian mother for disclosing inadequately anonymized data which could enable her to be outed.

Indeed next time round Netflix want to release even more info about their users including ZIP codes, ages and gender - even though some time ago Latanya Sweeney showed that zip code, birth date and gender combined were enough to uniquely identify 87% of the US population (demonstrated by her pinpointing Massachusetts Governor William Weld's health data from a supposedly anonymised set of state employee patient records)!

It's not just health data that needs adequate anonymisation. GPS data has also been reidentified, even though geo-location data is some of the most private there is, from a personal security and privacy viewpoint (there's a reason why "We know where you live" is one of the most potent threats). Incidentally, location data isn't considered "sensitive data" for which more careful treatment is required under the EU Data Protection Directive, but it ought to be.

Equally important and potentially troubling, social networking users have been re-identified from "anonymised" social networking data too (by Narayanan and also other researchers, see the Broken Promises paper for references).

On identification generally (not just in relation to re-identification of people from released "anonymised" data), Ben Laurie puts it very well that linkability is the key:

"If you view life as a sequence of transactions, then privacy can be defined as
the linkability of the transactions. Absolute privacy occurs when none of the
transactions are linkable, and zero privacy occurs when they are all publicly
available and completely linkable."

So one general question is, do unique identifiers really always have to be used for the same people in different contexts (e.g. Social Security numbers)? To what extent can the constant use of the same unique ID be avoided? Thought needs to be given to this because it's obviously key to linkability. (It's possible to have ID cards which support data minimisation, pseudonymity & unlinkability - see Dave Birch's brilliant paper Psychic ID, a great read especially if you're a Doctor Who fan! No don't run away, really.)

Probably the toughest issue to address is, as Paul Ohm has pointed out, how to release data that's anonymized enough to protect privacy while still being useful enough to medical and other legitimate researchers. The same dataset may well have a different "balance point", and may need to be anonymized in different ways to different extents, when released for different research purposes.

While on the subject of anonymity and pseudonymity, and why people should need to be anonymous or pseudonymous, I like Ben Laurie's blog post - being able to create multiple personas in the digital environment would simply allow us to do online what we naturally do (and need to do) in real life:

"It seems to me that this is merely reflecting what people do in meatspace: your colleagues at work don’t need to know what you do in your bedroom, or what you had for dinner. Why should this be any different on the ‘net? It shouldn’t – but the way we’ve set it up means it is. A resourceful gatherer of data can correlate everything I do online. This is bad, not just because I’m a privacy nut, but because it actually affects peoples lives, and not in a positive way: studies have shown that if people believe they are being observed, then they tend to alter their behaviour to match what they think the observer wants to see. I want people to be able to do their thing without fear of consequences from bigots or The Man or even “ordinary people”. None of us are ordinary and the world will be a poorer place if we were made to be."

10. Checks and balances (both legal and technological) in relation to data aggregation, data mining and profiling and similar techniques

Unique IDs, aggregation and linkability are again relevant here. New techniques for processing data sets to glean and link personally identifiable information will undoubtedly arise over time, and need to be monitored and if necessary regulated.

In the case of Internet search engine usage tracking it's technically possible to use cryptographic tools to distort the profile of users when they use search engines, to preserve their privacy and minimise aggregation of their data (see the 2009 Preserving user’s privacy in web search engines by Jordi CastellĂ -Roca, Alexandre Viejo and Jordi Herrera-JoancomartĂ­). But are users willing to pay for better privacy, would such technologies be adopted absent a legal requirement for it?

On online behavioural tracking specifically, which is another topical issue, a useful paper is Online Behavioral Tracking and Targeting: Legislative Primer 2009, detailing concerns and solutions from the perspective of the US Center for Digital Democracy and certain other organisations.

11. Improved usability, awareness and education of users and designers & developers

- that takes into account the various privacy paradoxes, in order to empower users to disclose their data selectively in an informed way in accordance with their individual needs and preferences.

The privacy paradox: people say that they value their privacy and online anonymity, but then disclose more personal info online than seems consistent with their statements. But there are also more paradoxes:

The control paradox:

"People desire full control on their personal data, but avoid the hassle to keep it up to date. People know that there are technology tools to protect them and think they may be efficient, but they do not use them."

The responsibility paradox:

"While most people believe that it is either their own responsibility, they seem to admit that many users do not have the knowledge to do this effectively."

- and the awareness paradox: greater knowledge of data protection rights does seem not to "influence the behavioural intention to adopt digital services based on personal data disclosure".

What explains these paradoxes? It seems to be mainly a question of organisations exploiting elements of human psychology, combined with insufficient privacy (and privacy tools) literacy amongst citizens and consumers.

In this area recently much attention has focused on privacy salience ("Reassuring people about privacy makes them more, not less, concerned"), and it seems that organisations have designed websites around this concept so as to not remind people that there are privacy concerns, given that websites usually want people to be as free as possible with their personal information as their business and profitability depends on the opposite, i.e. getting as much personal data out of people as possible.

In other words, if you have been alerted to privacy issues e.g. by a prominent privacy warning, then you are more likely to be careful about privacy and to disclose less information than if the issue had not been brought to mind - which is why social networking sites tend to avoid drawing attention to their privacy notices or policies especially when new users are signing up.

But equally relevant is the concept of engineered consent (discussed at 4 above). Indeed, designing a website so as to make privacy less salient to users (e.g. by tucking away the link to the site's privacy policy somewhere not at all obvious) is one of the most commonly used ways to "engineer" users' consent to the disclosure and use of their personal data.

Many of the above issues fall into the "we can so we will" category. E.g. if sites were banned from asking for "too much" information on pain of suffering penalties that hit them where it hurts, then they would be more restrained; but absent legal or regulatory requirements which are adequately formulated, policed and enforced, for reasons of monetary profit it simply makes commercial sense for enterprises to gather as much information about consumers as they can (and similarly for governments, in terms of crime and national security). So there is a case for legislative intervention in relation to "consent".

Also relevant of course are usability, and the technical inability of most individuals to take the steps necessary to secure privacy or anonymity with the appropriate tools or techniques.

I believe all these issues can only be dealt with properly by requiring more privacy friendly design of systems, sites and interfaces (i.e. privacy-preserving default settings, as people do defaults), as well as user education - all of which has to be kept under ongoing review as new potential threats to privacy arise (e.g. facial recognition) through developments in technology and computer processing power and speed, and the discovery or improvement of processing algorithms.

12. Secure and safe delegation of a user's rights

As a practical matter, delegation (a user allowing someone else to exercise their rights e.g. to access data) needs to be properly supported and addressed to deal with the possibility of untrustworthy delegates etc. Possibly even pseudonymous delegation.

©WH. This work is licensed under a Creative Commons Attribution Non-Commercial Share-Alike England 2.0 Licence. Please attribute to WH, Tech and Law, and link to the original blog post page. Moral rights asserted.