Website owners are once again at war with tools designed to scrape content from their sites. An AI scraper called img2dataset is scouring the Internet for pictures that can be used to train image-generating AI tools.
These generators are increasingly popular text-to-image services, where you enter a suggestion (“A superhero in the ocean, in the style of Van Gogh”) and it produces a visual to match. Since the system's "understanding" of images is a direct result of what it was trained on, there is an argument that what it produces consists of bits and pieces of all that training data, There’s a good chance there may be legal issues to consider, too. This is a major point of contention for artists and creators of online content generally. Visual artists don’t want their work being sucked up by AI tools (that make someone else money) without permission.
Unfortunately for the French creator of img2datset, website owners are very much dissatisfied with his approach to harvesting images.
The free program “turns large sets of image URLs into an image dataset”. Its claimed the tool can “download, resize, and package 100 million URLs in 20 hours on one machine”. That’s a lot of URLs.
What’s aggravating site owners is that the tool is ignoring assumed good netiquette rules. Way back in 1994, “robots.txt” was created as a polite way to let crawlers know which bits of a website they were allowed to pay a visit to. Search engines could be told “Yes please”. Other kinds of crawlers could be told “No thank you”. Many rogues would simply ignore a site’s robots.txt file, and end up with a bad reputation as a result.
This is one of the main complaints where img2dataset is concerned. Website owners contend that it's not physically possible to have to tell every tool in existence that they wish to opt-out. Rather, the tool should be opt-in. This is a reasonable concern, especially as site owners would essentially be responsible for adding ever more entries to their code on a daily basis.
One site owner had this to say, in a mail sent to Motherboard:
I had to pay to scale up my server, pay extra for export traffic, and spent part of my weekend blocking the abuse caused by this specific bot.
Elsewhere, you can see a deluge of complaints from site owners on the tool’s “Issues” discussion page. Issues of consent, custom headers, even talk of the creator being sued: It’s chaos over there.
If you’re a site owner who isn’t keen on img2dataser paying a visit, there are a number of ways you can tell it to keep a respectful distance. From the opt-out directives section:
Websites can use these http headers:” X-Robots-Tag: noai”, “X-Robots-Tag: noindex” , “X-Robots-Tag: noimageai”, and “X-Robots-Tag: noimageindex”. By default, img2dataset will ignore images with such headers.
However, the FAQ also says this for users of the img2dataset tool:
To disable this behaviour and download all images, you may pass “--disallowed_header_directives ''”
This does exactly what it suggests, ignoring the "please leave me alone" warning and grabbing all available images. It’s no wonder, then, that website owners are currently so hot and bothered by this latest slice of website scraping action. With little apparent interest in robots.txt from the creator, and workarounds to ensure users can grab whatever they like, this is sure to rumble on.
Malwarebytes removes all remnants of ransomware and prevents you from getting reinfected. Want to learn more about how we can help protect your business? Get a free trial below.