According to Georgetown Law Technology Review, “Data re-identification occurs when personally identifying information is discoverable in scrubbed or so-called “anonymized” data. When a scrubbed data set is re-identified, either direct or indirect identifiers become known and the individual can be identified. Direct identifiers reveal the real identity of the person involved, while the indirect identifiers will often provide more information about the person’s preferences and habits. Scrubbed data can be re-identified through three methods: insufficient de-identification, pseudonym reversal, or combing datasets. These techniques are not mutually exclusive; all three can be used in tandem to re-identify scrubbed data.”
The Review by Georgetown Law Technology explains, “Insufficient de-identification occurs when a direct or indirect identifier inadvertently remains in a data set that is made available to the public. Both structured and unstructured data are prone to re-identification, as inadvertently leaving direct or indirect identifiers can lead to the discovery of a person’s identity. Structured data are those that organize the information into tables with identified values. Tables 1-3 above are structured data, as the column containing the name, date, zip code etc. are clearly identified. Unstructured data is basically everything else—it is usually plain text and can be much more variable. Internet searches, doctors’ notes, and voice commands are all unstructured data.”
Organizations employ basic de-identification techniques that remove only the most obvious identifiers like names and medical record numbers, this approach often leaves behind data elements that can serve as quasi-identifiers. For example, rare medical conditions, unique treatment patterns, or specific demographic combinations.
Some healthcare systems replace direct identifiers with pseudonyms or codes rather than removing them entirely. While this approach preserves data linkability for research purposes, it creates vulnerability if the mapping between pseudonyms and real identities is compromised.
“Pseudonyms are only an effective scrubbing mechanism if they cannot be reversed. There are several ways pseudonymization can be defeated. Some pseudonyms are designed to be reversible and a “key” is kept to reverse the process, however, this precludes their security function. Secondly, the longer the same pseudonym is used for a specific individual, the less secure and easier it is to re-identify that individual. Thirdly, if the method used to assign pseudonyms is discovered or becomes know the data can be re-identified,” explains the Review.
Lastly, the Review explains that, “The most powerful tool for re-identifying scrubbed data is combining two datasets that contain the same individual(s) in both sets. When two or more anonymized datasets are linked together, they can then be used to unlock other anonymized datasets. Once one piece of data is linked to a person’s real identity, that data can then be used to destroy the anonymity of any virtual identity with which that data is associated. The ability to link even supposedly innocuous data exposes people to potential harm because of this.”
Social media, public records, data breaches, and commercial databases all provide reference points that can be matched against supposedly anonymous health data.
Several factors have intensified re-identification risks:
In their study, Estimating the success of re-identifications in incomplete datasets using generative models, Luc Rocher, Julien M. Hendrickx, and Yves-Alexandre de Montjoye provide cases of when data was re-identified, “In 2016, journalists re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences. A few months before, the Australian Department of Health publicly released de-identified medical records for 10% of the population only for researchers to re-identify them 6 weeks later. Before that, studies had shown that de-identified hospital discharge data could be re-identified using basic demographic attributes and that diagnostic codes, year of birth, gender, and ethnicity could uniquely identify patients in genomic studies data. Finally, researchers were able to uniquely identify individuals in anonymized taxi trajectories in NYC.”
Re-identification implications include:
The NIH in: The risk of re-identification versus the need to identify individuals in rare disease research states, “Re-identification may potentially bring harmful consequences for the individual, for example, related to insurance and discrimination.”
It further provides that, “[While] searching through 1522 reports of demonstrated re-identifications, El Emam concluded that the overall success rate for all re-identification attacks was approximately 26 and 34% for health data. However, the confidence interval around these estimates was large, partly because many of the attacks were on small databases. In addition, not all of these examples were using current standards for de-identification, such as the USA Safe Harbor standard or the statistical standard specified in the HIPAA Privacy Rule.”
Healthcare organizations can minimize re-identification risks by:
Direct identifiers directly reveal a person’s identity, while indirect identifiers offer clues about a person’s preferences or habits.
Insufficient de-identification happens when identifiers remain in a dataset, allowing for potential re-identification of individuals.
Quasi-identifiers are pieces of information that, on their own, may not identify an individual but can potentially be used in combination with other data to re-identify a person.