How nine out of ten healthcare pages leak private data

library
crime | hacking | security

A study by a Timothy Libert, a doctoral student at the University of Pennsylvania, has found that nine out of ten visits to health-related web pages result in data being leaked to third parties like Google, Facebook and Experian:

There is a significant risk to your privacy whenever you visit a health-related web page. An analysis of over 80,000 such web pages shows that nine out of ten visits result in personal health information being leaked to third parties, including online advertisers and data brokers.

What Libert discovered is a widespread repetition of the flaw that the US government’s flagship Healthcare.gov website was dragged over the coals for in January.

The sites in question use code from third parties to provide things like advertising, web analytics and social media sharing widgets on their pages. Because of the way those kinds of widgets work, their third party owners can see what pages you’re visiting.

The companies supplying the code aren’t necessarily seeking information about what you’re looking at but they’re getting it whether they want it or not.

So if you browse the pages about genital herpes on the highly respected CDC (Centres for Disease Control and Prevention) site you’ll also be telling marketing mega-companies Twitter, Facebook and AddThis that you’ve an interest in genital herpes too.

It happens like this: when your browser fetches a web page, it also fetches any third party code embedded in it directly from the third parties’ websites. The requests sent by your browser contain an HTTP header (the annoyingly misspelled ‘referer’ header) that includes the URL of the page you’re looking at.

Since URLs tend to contain useful, human-readable information about what you’re reading, those requests can be quite informative.

For example, looking at a CDC page about genital herpes triggers a request to addthis.com like this:

GET /js/300/addthis_widget.js HTTP/1.1
Host: s7.addthis.com
…
Referer: http://www.cdc.gov/std/Herpes/default.htm

The fact that embedded code gets URL data like this isn’t new – it’s part of how the web is designed and, like it or not, some third parties actually rely on it – Twitter uses it to power its Tailored Suggestions feature for example.

What’s new, or perhaps what’s changed, is that we’re becoming more sensitive to the amount of data we all leak about ourselves and, of course, health data is among the most sensitive.

While a single data point such as one visit to one web page on the CDC site doesn’t amount to much, the fact is we’re parting with a lot of data and sharing it with the same handful of marketing companies.

We do an awful lot of healthcare research online and we tend to concentrate those visits around popular sites.

A 2012 survey by the Pew Research Center found that 72% of internet users say they looked online for health information within the past year. A fact that explains why one of the sites mentioned in the study, WebMD.com, is the 106th most popular website in the USA and ranked 325th in the world.

The study describes the data we share as follows:

…91 percent of health-related web pages initiate HTTP requests to third-parties. Seventy percent of these requests include information about specific symptoms, treatment, or diseases (AIDS, Cancer, etc.). The vast majority of these requests go to a handful of online advertisers: Google collects user information from 78 percent of pages, comScore 38 percent, and Facebook 31 percent. Two data brokers, Experian and Acxiom, were also found on thousands of pages.

If we assume that it’s possible to imply an individual’s recent medical history from the healthcare pages they’ve browsed over a number of years then, taken together, those innocuous individual page views add up to something very sensitive.

As the study’s author puts it:

Personal health information … has suddenly become the property of private corporations who may sell it to the highest bidder or accidentally misuse it to discriminate against the ill.

There is no indication or suggestion that the companies Limbert named are using the health data we’re sharing but they are at least being made unwitting custodians of it and that carries some serious responsibilities.

Although there is nothing in the leaked data that identifies our names or identities, it’s quite possible that the companies we’re leaking our health data to have them already.

Even if they don’t though, we’re not in the clear.

Even if Google, Facebook, AddThis, Experian and all the others are at pains to anonymise our data, I wouldn’t bet against individuals being identified in stolen or leaked data.

It’s surprisingly easy to identify named individuals within data sets that have been deliberately anonymised.

For example, somebody with access to my browsing history could see that I regularly visit Naked Security for long periods of time and that those long periods tend to happen immediately prior to the appearance of articles written by Mark Stockley.

For a longer and more detailed look at this phenomenon, take a look at Paul Ducklin’s excellent article ‘Just how anonymous are “anonymous” records?‘

It’s possible to stop this kind of data leak by setting up your browser so it doesn’t send referer headers but I wouldn’t rely on that because there are other ways to leak data to third parties.

Instead I suggest you use browser plugins like NoScript, Ghostery or the EFF’s own Privacy Badger to control which third party sites you have any interaction with at all.

What the study hints at is bigger than that though – what it highlights is that we live in the era of Big Data and we’re only just beginning to understand some of the very big implications of small problems that have been under our noses for years.