
Last week, we introduced the topic of de-identified data, its exceptions under various comprehensive privacy laws, and two methods of how to properly de-identify data. In this week’s edition, we will go into more detail on the Safe Harbor method recommended for HIPAA compliance. We focus on this method, as it appears to be the only process readily available for more universal de-identification outside the healthcare sector.
The Safe Harbor Identifiers
45 C.F.R. § 164.514(b)(2) lists the 18 categories of identifiers that, if removed, would amount to de-identification of key data attributes. The categories are as follows: (a) names; (b) All geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly available data from the Bureau of the Census; (c) All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older; (d) telephone numbers; (e) fax numbers; (f) electronic mail addresses; (g) social security numbers; (h) medical record numbers; (i) health plan beneficiary numbers; (j) account numbers; (k) certificate/license numbers; (l) vehicle identifiers and serial numbers, including license plate numbers; (m) device identifiers and serial numbers; (n) Web Universal Resource Locators (URLS); (o) Internet Protocol (IP) address numbers; (p) biometric identifiers, including finger and voice prints; (q) full face photographic images and any comparable images; and (r) any other unique identifying number, characteristic, or code.
Removal of Identifiers
Removing 18 identifiers from personal data can seem to be an arduous, almost insurmountable task. This is especially true as information or data is displayed in various forms and formats, making it even more difficult to catch any potential identifiers. The U.S. Department of Health and Human Services even goes as far as to say that “[t]he de-identification standard makes no distinction between data entered into standardized fields and information entered as free text (i.e., structured and unstructured text) — an identifier listed in the Safe Harbor standard must be removed regardless of its location in a record if it is recognizable as an identifier.” So how can we be sure to find every identifier in a given set of data?
One study by the Biomedical Informatics Center at the Medical University of South Carolina had similar worries. The study tested multiple de-identifying tools on the market, including AWS Comprehend Medical PHId, Clinacuity’s CliniDeID, and the National Library of Medicine’s (NLM) Scrubber. These were tested for their speed and accuracy in the de-identification process. The study found that “[n]o single system dominated all the compared metrics. NLM Scrubber was the fastest, while CliniDeID generally had the highest accuracy.” The study notes further that there were some gaps in category coverage when AWS Comprehend was compared with CliniDeID. That said, supplementing a secondary system could extend AWS Comprehend’s category coverage.
Risk of Re-Identification under the Safe Harbor Method
Even when utilizing these tools, a cloud still hangs over the possibility of re-identifying the data. Several studies have been conducted to address that concern, and thereby confirm whether the Safe Harbor method is sufficient to prevent re-identification.
The Vanderbilt Health Information Privacy Lab answered that question in the positive when they performed a state-by-state experiment and concluded that the risk of unique re-identification was between 0.01% and 0.25% by following the Safe Harbor method, while allowing data in which the only identifying information is the year of birth, sex, and 3-digit ZIP code.
Similarly, the Office of the National Coordinator for Health Information Technology (ONC HIT) at the U.S. Department of Health and Human Services conducted a test of the HIPAA Safe Harbor method by focusing on 15,000 hospital admission records of Hispanic individuals between 2004 and 2009. The researchers relied on using the U.S. Census data and commercially available data from InfoUSA to match the hospital records using age, sex, and the first three digits of the zip code (as allowed by HIPAA). This study concluded that there was a re-identification risk of only 0.22%, using conservative estimates.
Irrespective of the de-identification method, the reality is that there will always be a chance for re-identification, albeit very small. Based on several independent studies, the Safe Harbor method appears to provide for the greatest balance between the risk of re-identification and the need to retain some utility in the data.
Nevertheless, it is important to note that besides these listed identifiers, there are other possible and sometimes unconventional methods that allow for re-identification, such as medical tests. One study, titled “Reducing patient re-identification risk for laboratory results within research datasets,” used a database of 8.5 million Safe Harbor de-identified lab results sourced from just over 61,000 patients. Researchers found that simply using four consecutive laboratory test result was enough to uniquely distinguish between 34% and 100% of the patients. This increased to likelihood to greater than 95% of the 61,000 unique patients being re-identified, based on sequences of five and six laboratory test results.
In a more practical scenario, another study focused on the re-identifiability of credit card metadata. Using a sample of 1.1 million people from an unnamed country, the researchers found that using four distinct points in space and time were sufficient to uniquely identify 90% of the individuals in their sample. However, by lowering the geographical scope and simplifying transaction values (e.g., a purchase of $14.86 would instead be represented in a range between $10.00 and $19.99), the number of points required increased, thereby making it more difficult to re-identify.
Considerations Moving Forward
As we’ve seen, the HIPAA Safe Harbor method is the optimal choice to de-identify data due to its straightforward approach and feasibility. But simply following the method may not completely remove the core issue of re-identification. In case of data breaches, a clever attacker may utilize other points of data to uniquely re-identify individuals. Entities that utilize de-identified data should account for what sort of data is readily available that may be used to cross-reference any de-identified data points. By accounting for these possibilities, entities can further distinguish de-identified data, making it more difficult by requiring more data points to uniquely re-identify, as illustrated by the study focusing on re-identification using credit card metadata. Regardless of this possibility, de-identification is currently the best choice for entities to process data while limiting their exposure to data privacy laws.
The information you obtain at this site, or this blog is not, nor is it intended to be, legal advice. You should consult an attorney for advice regarding your individual situation. We invite you to contact us through the website, email, phone, or through LinkedIn. Contacting us does not create an attorney-client relationship. Please do not send any confidential information to us until such time as an attorney-client relationship has been established.