3 min read
How to de-identify protected health information for privacy
Liyanda Tembani July 26, 2024
Protected health information (PHI) holds sensitive details about individuals' medical conditions, treatments, and histories. HIPAA outlines two methods to de-identify PHI for privacy: the Expert Determination Method and the Safe Harbor Method. The Expert Determination Method requires a qualified expert to use statistical and scientific principles to determine that the risk of re-identification is very small. The Safe Harbor Method removes the 18 specific identifiers, such as names, geographic details, dates, and other unique characteristics, ensuring the information cannot be traced back to an individual.
Understanding de-identification
According to the HHS, "The process of de-identification, by which identifiers are removed from the health information, mitigates privacy risks to individuals and thereby supports the secondary use of data for comparative effectiveness studies, policy assessment, life sciences research, and other endeavors.". It helps minimize the risk of re-identification, ensuring that PHI cannot be linked back to specific individuals. The de-identification process aligns with the guidelines set forth by the HIPAA privacy rule.
Identifying PHI elements
Before diving into de-identification methods, you must understand the elements that make PHI identifiable. These include names, addresses, dates (except the year of birth), social security numbers, phone numbers, email addresses, and other unique identifying information. These identifiers must be carefully handled during the de-identification process.
Related: What are the 18 PHI identifiers?
De-identification methods
1. Expert determination method
This involves an expert in statistical and scientific methods evaluating data to determine that the risk of re-identification is minimal. This expert assesses various factors and applies rigorous techniques to ensure the de-identified data cannot be linked back to individuals.
- Statistical techniques: Statistical methods such as k-anonymity, differential privacy, and generalization can be employed to reduce the risk of re-identification. K-anonymity ensures that each record in the de-identified dataset is indistinguishable from at least k-1 other records. Differential privacy adds noise to the data to protect individual privacy while preserving the overall statistical properties of the dataset. Generalization involves replacing specific values with broader categories (e.g., replacing exact age with age ranges) to decrease the risk of identification.
- Contextual evaluation: The expert considers the data's context and the intended use of the de-identified information. They evaluate factors such as the size and uniqueness of the dataset, the presence of indirect identifiers, and the available external data sources to assess the re-identification risk accurately.
2. Safe harbor method
The safe harbor method focuses on the removal of specified identifiers from the data. By removing names, addresses, and other identifying elements, the risk of re-identification is substantially reduced. This method follows a predetermined set of 18 identifiers that must be removed, leaving only a limited data set that may still have some inherent privacy risks.
- Specified identifiers: The safe harbor method identifies specific elements that must be removed, such as names, geographic subdivisions smaller than a state, dates (except the year of birth), telephone and fax numbers, email addresses, social security numbers, medical record numbers, and more. By eliminating these identifiers, the risk of re-identification is reduced.
- No actual knowledge or reasonable basis: In addition to removing specified identifiers, you must ensure that there is no actual knowledge or reasonable basis to believe that the remaining information can identify an individual. Consider the potential for indirect identification, where combinations of remaining data could lead to re-identification.
Best practices for de-identifying PHI
1. Data anonymization and pseudonymization
Anonymizing or pseudonymizing data involves removing or replacing direct and indirect identifiers to protect privacy while maintaining the utility of the data. These techniques can be applied to minimize re-identification risks:
- Generalization involves replacing specific values with broader categories.
- Suppression removes certain data points entirely.
- Data perturbation adds random noise to that data, making it difficult to identify individuals.
- Tokenization replaces identifying information with unique tokens, preserving data structure while removing direct identifiers.
2. Data integrity and utility
De-identified data should still maintain its integrity and be useful for the intended purposes. You must balance privacy protection and data utility to ensure the de-identified data remains valuable for research, analysis, and other applications. Organizations should assess the potential impact of de-identification techniques on the data's quality, accuracy, and usefulness and strive to maintain the data's integrity throughout the process.
3. Safeguards and controls
Implementing appropriate safeguards and controls to protect de-identified data includes stringent data access restrictions, encryption of sensitive information, and comprehensive security measures. Access controls should be in place to limit data access to authorized personnel only, and encryption should be applied to protect data both at rest and in transit. Conduct regular security audits and assessments to identify and address any vulnerabilities.
4. Data governance and documentation
Establishing robust data governance practices is crucial for effective de-identification. Organizations should document their de-identification processes, including the methods employed, the rationale behind decisions, and any data transformations applied. Maintaining detailed documentation allows for transparency, accountability, and reproducibility of the de-identification process. It also assists in compliance with regulatory requirements and provides a reference for future data usage.
Related: De-identification: its value to businesses and how to do it right
De-identification minimizes risk
De-identifying PHI is a powerful tool to minimize the risk of re-identification and ensure privacy in healthcare data. Additionally, encrypt all data by default HIPAA compliant email when sharing PHI. It's a standard best practice and helps maintain HIPAA compliance, even if some information is identifiable.
FAQs
What is pseudonymization, and how is it different from anonymization?
Pseudonymization replaces identifiable information with pseudonyms, which can be re-linked to the original data, whereas anonymization removes all identifiable information permanently.
How does de-identification support data sharing for public health purposes?
De-identification allows health data to be shared for public health research and policy assessments without compromising individual privacy, facilitating valuable insights and advancements.
What is the role of encryption in the de-identification process?
While encryption is not a de-identification method, it protects data during transmission and storage, ensuring that even if data is intercepted, it remains unreadable.
Subscribe to Paubox Weekly
Every Friday we'll bring you the most important news from Paubox. Our aim is to make you smarter, faster.