Collection buckets

Google plans to scrape everything you post online to train its AI

Additions to Google’s Privacy Policy are making some observers worry that all of your content is about to be fed into Google’s AI tools. Alterations to the T&Cs now explicitly state that your “publicly available information” will be used to train in-house Google AI models alongside other products.

From the Privacy Policy page:

In some circumstances, Google also collects information about you from publicly accessible sources. For example, if your name appears in your local newspaper, Google’s search engine may index that article and display it to other people if they search for your name. We may also collect information about you from trusted partners, such as directory services who provide us with business information to be displayed on Google’s services, marketing partners who provide us with information about potential customers of our business services, and security partners who provide us with information to protect against abuse. We also receive information from advertising partners to provide advertising and research services on their behalf.

You may be wondering where the reference to AI comes into play here. Me too! I’ve given talks on EULAs and privacy policies regarding some of the most excessive privacy policies around. I waded through every section tied to the privacy policy page, and I couldn’t find the relevant section. It eventually had to be pointed out to me that what look like hyperlinks leading off-site are actually links to pop open additional information on the terms used.

With this in mind, going back to the above extract, we need to click on “Publicly accessible sources” to see the following:

For example, we may collect information that’s publicly available online or from other public sources to help train Google’s AI models and build products and features, like Google Translate, Bard and Cloud AI capabilities. Or, if your business’ information appears on a website, we may index and display it on Google services.

Public sources

Given the controversy over AI use generally, it might not seem like the best idea to have this information be easily missed on a page where it should perhaps be a lot more prominent.

What does this mean in plain terms? In pre-AI times, if you posted something online, whether a blog, a photograph, a piece of music or something else, there’s a good chance it would end up scraped by a search engine. This is how search engines work, and this is how you find the content you’re looking for when entering search terms. 

But what Google is saying here is that from now on, all of the above will still happen. It’s just that the new addition means your text, photos, and music could end up helping to train its products and “AI models”.

As Gizmodo notes, previously it only referenced the popular Translate tool. Now Bard and Cloud AI are thrown into the mix. Bard is Google’s AI chat service, and if you were wondering: it does indeed make use of images. It ran into teething problems shortly after release, sharing false information in its own announcement. It’s no wonder that Google would try and make as much data as possible up for grabs with regard to feeding the ever-hungry AI tools with more information.

With so many AI tools doing things like falsely claiming that people have written articles or just running into copyright trouble generally, we have no real way to know if this will actually improve anything. You may have had some objections to search engines making bank from content you post online, but there is some positive return there in the form of your content being placed in front of people. Now we have AI spam posing a threat to said engines, while your content is potentially being monetised twice over with new AI policies coming into force.

Although the initial outlook for AI-generated content and scraping looks grim, it’s arguable if the current spam laden system is much better. The problem is we may just be trading one set of poor results and faulty tools for another.


We don’t just report on threats—we remove them

Cybersecurity risks should never spread beyond a headline. Keep threats off your devices by downloading Malwarebytes today.

ABOUT THE AUTHOR

Christopher Boyd

Former Director of Research at FaceTime Security Labs. He has a very particular set of skills. Skills that make him a nightmare for threats like you.