This is part 4 in the series on data privacy in passive metering studies in which we will discuss how to balance between the privacy and data utility. Are you looking to catch up on the series? Check out the previous parts.
Consumers create millions of moments on their devices. Those data entries can contain direct identifiers and can constitute a sensitive portrait of the participant. Once holding individual's confidential information collected by passive metering, it is important to keep it that way.
In an internal study conducted in March ’16, we found that a simple eye inspection of a clickstream of behavioral data made it possible to identify 85% of the users in the dataset (real names, emails and zip codes).
Thus, we recommend two measures to ensure that, in case of a breach, the leaked data won’t contain personal identifiable information (PII).
First, the data collected by the meter should be dissociated from the PII of the participant. A common practice in the industry is to store participant’s data in two databases, one containing only their PII (e.g. name, address, zip code, etc.) and one containing the behavioral data (e.g. navigation history). In case re-identification is needed, a unique identifying number (unknown to a potential attacker) can be used to match the databases. However, PII can still be present in the data collected by the meter. For example, if a participant, let’s say John Smith, visits his Facebook profile, the meter will send us the following URL: www.facebook.com/john.smith. Dissociation won’t be useful here since the full name of the participant is present in the data.
A second, less common measure in the industry, is to anonymize the chunk of URL that contains PII and aggregate the information in higher levels of abstraction (e.g. only retaining the main domain). In doing so, a potential leak would not reveal the identity of the participants. However, also the paths of the consumers would be removed: www.amazon.com/ instead of www.amazon.com/store/books/userid=john.smith.
In this case, the only kept information would be that the user visited Amazon, but not which product the user looked for. Privacy is achieved, but with a substantive loss of information.
Balancing privacy and data utility
At Wakoopa, we developed and implemented a model that is aiming to reach a good balance between privacy and utility. However, to automatically anonymize large datasets without a major loss in the utility of the data is not an easy task. It is very difficult to determine which parts of a URL might identify someone.
Our approach is based on the idea to mask the parts of the URL that contain direct identifiers. The logic behind our filter is a statistical one: If any part in the URL is personal (a parameter, a name, etcetera), then it is very unlikely that this URL has more than one visitor. So, the basis of our masking procedure is to generalize URL components to ensure that at least K participants in our data set have visited the generalized URL. This procedure effectively reduces the possibility of identifying participants.
This strategy unfortunately does not only filter PII, but also any other pieces of data that happens to have been visited in our data set by very few people. This is unavoidable: any de-identification process based on generalization involves a trade-off between the risk of identification and the amount of information preserved. Or, as a foundational (and bitter) belief about data privacy states:
"Data can be either useful or perfectly anonymous but never both." (Ohm 2010)
It is a balancing act, but we choose to be conservative and decide to favor privacy. A key idea here, however, is that the information lost has by definition little statistical interest in the data set, because it is present only in small amounts. In that sense we believe that the quality of the data for opinion and market research purposes is reasonably well preserved.
We hope the implementation of such a model becomes a standard in the industry as other companies are beginning to use it, too.