Monday, 1 February 2010

Anonymising patient health data for publication - guidelines proposed

Minimum standards for de-identifying or anonymizing datasets, in order to preserve patient privacy while allowing sharing personal data from clinical trials, have been suggested by Iain Hrynaszkiewicz and colleagues in an article in the British Medical Journal.

Their guidelines may be of general interest, although the proposals were made specifically for the purposes of sharing data with other medical researchers or publishing articles on biomedical research in peer reviewed journals - at least in the context of quantitative research data i.e. most observational studies and randomised controlled trials. (Many journals (and indeed funders of medical research) require authors to be prepared to share their raw, unprocessed data with other scientists or to state the availability of (unpublished) raw data in published articles.)

Given the paucity of information on how raw data should be prepared for publication or sharing and the apparent lack of agreed definitions of what is anonymised patient information, their article Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers (provisional PDF) aims to provide "practical advice on data sharing while minimising risks to patient privacy":

"Basic advice on file preparation is provided along with procedural guidance on prospective and retrospective publication of raw data, with an emphasis on randomised controlled trials."

From the summary:

"Consent for publication of appropriately anonymised raw data should ideally be sought from participants in clinical research.

Direct identifiers such as patients’ names should be removed from datasets; datasets that contain three or more indirect identifiers, such as age or sex, should be reviewed by an independent researcher or ethics committee before being submitted for publication."

There are also recommendations on what statements should be made in the article in relation to patient consents.


The guidance lists 28 items of personal and clinical information that can make patients identifiable (formulated from information aggregated from policy documents and research guidance from major UK and US funding agencies, governmental health departments and statutes, and three internationally recognised publication ethics resources for editors of biomedical journals).

It recommends removing direct identifiers from datasets before publication, e.g. patients' names and addresses.

It also recommends that, unless patients have consented to the sharing of their data, datasets containing 3 or more indirect identifiers, e.g. age or sex, should be reviewed by an independent researcher or ethics committee to assess any risks to privacy before submitting the data for publication - and if the review finds privacy risks, alternatives to fully open access data sharing must be considered.

"An explicit justification for publication of a dataset with three or more indirect identifiers should be given by the researcher—as an annotation to the dataset and in any accompanying articles. This should include the name of any oversight bodies consulted."

They don't say why they decided on the number three, when they drew the line at 3 or more indirect identifiers, but perhaps the US research on the identifiability of people from zip code, age and gender were an influence (see item 9 on anonymisation and re-identification in my Data dozen blog).

I'd be interested to know from information scientists and mathematicians whether "only 1 or 2 indirect identifiers" is enough to protect patient privacy - doesn't it depend on which identifiers, and/or the ability to link other data?

In relation to dates, they suggest the possibility of obfuscation:

"one could add or subtract a small, randomly chosen number of days to all dates, so that the true dates are not published. In cases where it is necessary to include dates, this fact and any supporting information should be disclosed on submission to the journal."

Here is their table listing potential patient identifiers in datasets (see the article for the information sources references):

Identifier (information sources) Comments
Name (8-15)
Initials (13)
Address, including full or partial postal code (8-15)
Telephone or fax numbers or contact information (8 10 12 15)
Electronic mail addresses (8)
Unique identifying numbers (8-15) Generalised HIPAA items 7-10, 18
Vehicle identifiers (8)
Medical device identifiers (8)
Web or internet protocol addresses (8)
Biometric data (8)
Facial photograph or comparable image (8 10 11 13)
Audiotapes (11)
Names of relatives (10)
Dates related to an individual (including date of birth) (8 9 11 15)
Indirect—may present a risk if present in combination with others in the list
Place of treatment or health professional responsible for care (10 15) Could be inferred from investigator affiliations
Sex (9)
Rare disease or treatment (10)
Sensitive data, such as illicit drug use or "risky behaviour" (15)
Place of birth (10 15)
Socioeconomic data, such as occupation or place of work, income, or education (9 10 12 15) MRC requirement is for "rare" occupations only
Household and family composition (15)
Anthropometry measures (15)
Multiple pregnancies (15)
Ethnicity (9)
Small denominators—population size of <100 (14)
Very small numerators—event counts of <3 (14)
Year of birth or age (this article) Age is potentially identifying if the recruitment period is short and is fully described
Verbatim responses or transcripts (15)

See also BioMed Central blog post on this, and the British Medical Journal's editorial policy on data sharing (needs subscription to view full text). According to the press release:

"the BMJ strongly supports the view that researchers should seek informed consent to data sharing from research participants up front, at the recruitment stage. The journal will also expand its advice to authors about data sharing, and will extend its data sharing statements to include explicit information about consent."

©WH. This work is licensed under a Creative Commons Attribution Non-Commercial Share-Alike England 2.0 Licence. Please attribute to WH, Tech and Law, and link to the original blog post page. Moral rights asserted.