Web scraping—the automated extraction of data from websites—has been around for a long time. Simultaneously cursed and praised, with nobody being able to quite land the decisive blow about whether it should be allowed, one way or another.
This may have changed, thanks to a recent US appeals court ruling.
A tangled web of scraped content
LinkedIn (and, by extension, Microsoft) is not impressed with people or organisations scraping publicly available data from its site. In fact, they’re so massively notimpressed by the practice that things became legal in 2017 via a LinkedIn cease-and-desist. The social network objected to a company scraping public data from its pages, and the story rumbled into 2019 with another setback for the LinkedIn / Microsoft combo.
Last year, the data scraping saga was given one final chanceto swing a decision in favour of scraping being viewed as a very bad thing. The decision has now been made, and it’s not good newsfor LinkedIn. Scraping public data is not considered to be a violation of the Computer Fraud and Abuse Act.
LinkedIn has vowed to keep on fighting this one. However: Is scraping really that big a deal?
The case for
- The main argument in favour of scraping is that it is not a violation of privacy. It’s simply making use of content that has already been shared publicly.
- It’s fantastic for archival purposes. Thanks to link rot and link reuse, huge chunks of the Internet simply vanish on a daily basis: Websites go bust, pages are moved or removed.
The case against
- People who agree to share data on a site like LinkedIn probably don't expect their data to be hoovered up by third-parties, and may not even realise it's possible. So they don't understand the implications of sharing their personal information publicly. If the only safe course of action is to simply post nothing, that feels like quite a big chilling effect.
- Sometimes pages or sites go missing because the site owner wants them to go missing. There may be privacy reasons, or security issues, or something else altogether involved. Some archival sites and services will allow you to block their crawlers, but it can be a convoluted process and often involves a certain time and effort investment. Should people have to pre-emptively hunt down all the archival services in the first place to ensure something isn’t immortalised online forever?
- Scraping can have a big impact on sites and services generally. It can be a little overwhelming for a small site owner to try and stop content thieves and scrapers repurposing their content for ad clicks. Sometimes sites will grab content and place it alongside malware or phishing for an additional twist of “please stop doing that”.
It's verdict time
As you can see, I’m probably leaning more towards siding with LinkedIn on this one. Even so, with this latest decision in place and with so many frankly worrying ways scraped data can be misused, perhaps we areedging towards that previously mentioned chilling effect. One thing’s for sure, we’ll see this one back in a courtroom somewhere down the line.
As far as your own data goes, keep all of the above in mind. That one random photograph could be sucked up into a facial recognition platform. Your tweet from 11 years ago could be aggregated with other data about you in ways you hadn't anticipated. That incredibly awesome public work profile you created may just pull in a bunch of spammers and con artists.
Prune accordingly, and keep the really sensitive stuff away from public view. That way, no matter the end result of any number of court cases, you'll still hopefully have a firm grip of where your most important data ends up.