Mozilla research: Browsing histories are unique enough to reliably identify users
A recently published study conducted by three Mozilla employees has looked at the privacy provided by browsing histories.
Their findings show that most users have unique web browsing habits that allow online advertisers to create accurate profiles.
These profiles can then be used to track and re-identify users across different sets of user data that contain even small samples of a user’s browsing history.
Effectively, the study comes to dispel an online myth that browsing history, even the anonymized one, isn’t useful for online advertisers. In reality, the study shows that even a small list of 50 to 150 of the user’s favorite and most accessed domains can let advertisers create a unique tracking profile.
Confirming a similar 2012 study
The Mozilla research paper is named “Replication: Why We Still Can’t Browse in Peace: On the Uniqueness and Reidentifiability of Web Browsing Histories” [PDF].
The paper was presented earlier this month at the USENIX security conference, and is a follow-up to another academic study published in 2012 [PDF].
This first study was one of the biggest projects analyzing user privacy at the time, and a massive undertaking for the research team, which was involved in collecting browser history data from more than 380,000 internet users.
Between January 2009 and May 2011, researchers asked users to access an online test site where they used some clever CSS code to determine which websites from a predefined list of 6,000 domains users had visited.
The 2012 study found out that 97% of the users who accessed this test site had a unique list of sites in their browsing history, making browser history a solid user fingerprinting vector.
Furthermore, when users were asked to access the test site again, researchers said they were able to re-identify users based on their browsing history profiles from the first visit.
Accuracy rates were 38% when researchers looked at browsing history datasets of 50 of the user’s most popular domains, and 70% when they analyzed data sets with 500 domains.
The Mozilla 2020 paper
But last year, Mozilla researchers wanted to re-evaluate if browsing history was still a valid fingerprinting vector and if the 2012 study still holds true.
The new experiment got underway between July 16 and August 13, 2019, when Mozilla prompted Firefox users to take part of this experiment.
Mozilla researchers said that more than 52,000 users agreed to take part and agreed to provide anonymous browsing data.
However, this time around, since the data was collected from Firefox itself and not through a web page performing a time-lengthy CSS test, the data was much more accurate and reliable. Furthermore, the data Mozilla researchers collected is also about the same type of data that today’s online analytics companies also collect about users — either through data partnerships, mobile apps, online ads, or other mechanisms.
Just like before, the data collection took place across two stages, in two weeks, with users sharing browsing history in the first week, and then again in the second, so Mozilla researchers could see if they could re-identify users.
In total, the Mozilla team said it collected data about 35 million website visits to 660,000 unique domains. And this access to better quality data was immediately reflected in the study’s findings.
Mozilla said that 99% of the browsing profiles they collected for the study were unique to each user.
This uniqueness allowed Mozilla researchers to easily re-identify users during the second week of the study.
Accuracy was also superior to the 2012 study, with Mozilla claiming it had a nearly 50% reidentifiability rate for data sets containing 50 domains of a user’s browsing history. This reidentifiability rate grew to over 80% when Mozilla researchers expanded the browsing history data set to 150 domains.
This latter finding suggests that analytics firms and online advertisers don’t need huge lists of browsing history data in order to track users, and that each user’s browsing quirks and their favorite sites eventually give them away, even when the data is anonymized, and URLs truncated to remove usernames and leave only core domains.
A video of the Mozilla team’s presentation is available here.