Friday, 9 April 2010

Privacy & location info - re-identification risks, especially with health data

A new Canadian research study A Method for managing Re-identification Risk form Small Geographic Areas in Canada (full text), by Khaled El Emam, Ann Brown, Philip AbdelMalik, Angelica Neisa, Mark Walker, Jim Bottomley  and Tyson Roffey, published in the BMC Medical Informatics and Decision Making journal, measures how easy it is to determine the identity of individuals using their geographical information (de-anonymization), and suggests a method to reduce re-identification of individuals from anonymised datasets.

It notes the increasing collection by websites etc of location information, "such as where we live, where the clinic we visited is located, and where we work", and the resulting privacy concerns when location is coupled with basic demographics and sensitive health information. Individuals living in small areas tend to be more easily identifiable because they are unique on their local demographics.

From the press release (NB in Word format!):

"Prof. Khaled El Emam, Canada Research Chair in Electronic Health Information and lead author, explains that they have developed a new method for measuring the privacy risk for Canadians, in particular, those living in small geographic areas. This privacy risk measure can then be used to decide whether it is appropriate to release/share geographic information or not and what demographics to include with this geographic information. The article also presents a set of criteria and checklists for managing the privacy risks when releasing/sharing location information.

“What we have developed is an overall risk management approach to decide how best to protect people’s privacy by taking into account their locations, the sensitivity of the data, and who they are sharing the data with,” explains Dr. El Emam.

This study shows that by protecting only the individuals living in small geographic areas, as defined by the new measures, it is possible to share more information while still being able to manage privacy risks."

From the abstract:

"A common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones. A recent study provided a method for deciding when an area is too small based on the uniqueness criterion. The uniqueness criterion stipulates that an the area is no longer too small when the proportion of unique individuals on the relevant variables (the quasi-identifiers) approaches zero. However, using a uniqueness value of zero is quite a stringent threshold, and is only suitable when the risks from data disclosure are quite high. Other uniqueness thresholds that have been proposed for health data are 5% and 20%…

We have also included concrete guidance for data custodians in deciding which one of the three uniqueness thresholds to use (0%, 5%, 20%), depending on the mitigating controls that the data recipients have in place, the potential invasion of privacy if the data is disclosed, and the motives and capacity of the data recipient to re-identify the data…

The models we developed can be used to manage the re-identification risk from small geographic areas. Being able to choose among three possible thresholds, a data custodian can adjust the definition of "small geographic area" to the nature of the data and recipient."

An interesting approach, and of course the issue of when an "area" is "too small", thus enabling de-anonymization, applies more widely to areas other than physical geographic location - especially when you can combine and link different types of data.

©WH. This work is licensed under a Creative Commons Attribution Non-Commercial Share-Alike England 2.0 Licence. Please attribute to WH, Tech and Law, and link to the original blog post page. Moral rights asserted.