Understanding data re-identification in healthcare

Written by Gugu Ntsele | February 27, 2025

According to Georgetown Law Technology Review, “Data re-identification occurs when personally identifying information is discoverable in scrubbed or so-called “anonymized” data. When a scrubbed data set is re-identified, either direct or indirect identifiers become known and the individual can be identified. Direct identifiers reveal the real identity of the person involved, while the indirect identifiers will often provide more information about the person’s preferences and habits. Scrubbed data can be re-identified through three methods: insufficient de-identification, pseudonym reversal, or combing datasets. These techniques are not mutually exclusive; all three can be used in tandem to re-identify scrubbed data.”

Re-identification techniques

Insufficient de-identification

The Review by Georgetown Law Technology explains, “Insufficient de-identification occurs when a direct or indirect identifier inadvertently remains in a data set that is made available to the public. Both structured and unstructured data are prone to re-identification, as inadvertently leaving direct or indirect identifiers can lead to the discovery of a person’s identity. Structured data are those that organize the information into tables with identified values. Tables 1-3 above are structured data, as the column containing the name, date, zip code etc. are clearly identified. Unstructured data is basically everything else—it is usually plain text and can be much more variable. Internet searches, doctors’ notes, and voice commands are all unstructured data.”

Organizations employ basic de-identification techniques that remove only the most obvious identifiers like names and medical record numbers, this approach often leaves behind data elements that can serve as quasi-identifiers. For example, rare medical conditions, unique treatment patterns, or specific demographic combinations.

Pseudonym reversal

Some healthcare systems replace direct identifiers with pseudonyms or codes rather than removing them entirely. While this approach preserves data linkability for research purposes, it creates vulnerability if the mapping between pseudonyms and real identities is compromised.

“Pseudonyms are only an effective scrubbing mechanism if they cannot be reversed. There are several ways pseudonymization can be defeated. Some pseudonyms are designed to be reversible and a “key” is kept to reverse the process, however, this precludes their security function. Secondly, the longer the same pseudonym is used for a specific individual, the less secure and easier it is to re-identify that individual. Thirdly, if the method used to assign pseudonyms is discovered or becomes know the data can be re-identified,” explains the Review.

Combining Datasets

Lastly, the Review explains that, “The most powerful tool for re-identifying scrubbed data is combining two datasets that contain the same individual(s) in both sets. When two or more anonymized datasets are linked together, they can then be used to unlock other anonymized datasets. Once one piece of data is linked to a person’s real identity, that data can then be used to destroy the anonymity of any virtual identity with which that data is associated. The ability to link even supposedly innocuous data exposes people to potential harm because of this.”

Social media, public records, data breaches, and commercial databases all provide reference points that can be matched against supposedly anonymous health data.

The growing risk in modern healthcare

Several factors have intensified re-identification risks:

The proliferation of public databases and social media profiles
Advanced data analytics and machine learning capabilities
Increased data sharing across healthcare systems and research institutions
The growing commercial value of health information

In their study, Estimating the success of re-identifications in incomplete datasets using generative models, Luc Rocher, Julien M. Hendrickx, and Yves-Alexandre de Montjoye provide cases of when data was re-identified, “In 2016, journalists re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences. A few months before, the Australian Department of Health publicly released de-identified medical records for 10% of the population only for researchers to re-identify them 6 weeks later. Before that, studies had shown that de-identified hospital discharge data could be re-identified using basic demographic attributes and that diagnostic codes, year of birth, gender, and ethnicity could uniquely identify patients in genomic studies data. Finally, researchers were able to uniquely identify individuals in anonymized taxi trajectories in NYC.”

Real-world implications

Re-identification implications include:

Potential discrimination in employment or insurance coverage
Exposure of sensitive health conditions leading to social stigma
Violation of patient trust and regulatory requirements
Financial penalties for healthcare organizations

The NIH in: The risk of re-identification versus the need to identify individuals in rare disease research states, “Re-identification may potentially bring harmful consequences for the individual, for example, related to insurance and discrimination.”

It further provides that, “[While] searching through 1522 reports of demonstrated re-identifications, El Emam concluded that the overall success rate for all re-identification attacks was approximately 26 and 34% for health data. However, the confidence interval around these estimates was large, partly because many of the attacks were on small databases. In addition, not all of these examples were using current standards for de-identification, such as the USA Safe Harbor standard or the statistical standard specified in the HIPAA Privacy Rule.”

Regulatory frameworks

HIPAA's Privacy Rule establishes standards for de-identification through either expert determination or safe harbor methods
Various state laws impose additional requirements on health data protection

Best practices for healthcare organizations

Healthcare organizations can minimize re-identification risks by:

Implementing formal risk assessment protocols before sharing data
Using advanced de-identification techniques
Creating data governance committees to evaluate sharing requests
Establishing contractual protections when sharing data with third parties
Conducting regular audits of existing datasets and sharing practices

FAQs

What are direct and indirect identifiers?

Direct identifiers directly reveal a person’s identity, while indirect identifiers offer clues about a person’s preferences or habits.

What is insufficient de-identification?

Insufficient de-identification happens when identifiers remain in a dataset, allowing for potential re-identification of individuals.

What are quasi-identifiers?

Quasi-identifiers are pieces of information that, on their own, may not identify an individual but can potentially be used in combination with other data to re-identify a person.

View full post