Meta is using your public Facebook and Instagram posts to train its AI

Post anything publicly on Facebook and Instagram? Meta has likely been using those posts to train its AI, according to the company’s top policy executive.

In an interview with Reuters, Meta President of Global Affairs Nick Clegg said the company used the public posts to train the LLM (large language model) that feeds into its new Meta AI virtual assistant.

Large Language Models (LLMs) are huge deep-neural-networks which are trained on the input of billions of pages of written material in a particular language, such as books, articles, and websites. So, in in the ongoing race between tech giants to create the best LLM it’s hardly surprising that they’re looking at social media as a giant source of data.

Clegg said that Meta excluded private posts shared only with family and friends, as well as private chats on its messaging services:

“We’ve tried to exclude datasets that have a heavy preponderance of personal information and the “vast majority” of the data used by Meta for training was publicly available.”

He also said they decided against using LinkedIn content for privacy reasons.

In separate news, X (formerly Twitter) updated its Terms of Service to let it use tweets for AI training. In July 2023, Elon Musk announced the launch of xAI to “understand the true nature of the universe.” In more realistic terms it looks like xAI will set out to compete with companies like OpenAI, Google, and Microsoft, which are behind leading chatbots like ChatGPT, Bard, and others.

Given that Musk threatened to sue Microsoft for using Twitter data for training, it may come as a surprise to some that the policy change states:

“We may use the information we collect and publicly available information to help train our machine learning or artificial intelligence models for the purposes outlined in this policy.”

Musk has already said that xAI will use public tweets for AI model training and in a tweet responding to comments about the policy change, Musk clarified that the plan is to use “just public data, not DMs or anything private.”

So, that seems to be the consensus about what is acceptable to scrape of your social media presence. If others can see it, it’s public knowledge and the tech giants are of the opinion they can use it to train their AI.

There is a world of difference to me, between data being publicly available and then feeding them into an AI that can combine it with information from other sources at a speed faster than any human is capable of.

Another undesirable side-effect of these developments is that the social media giants are relegating the responsibility for scraping copyright protected media and using them unwittingly. Asked whether Meta had taken any such steps to avoid the reproduction of copyrighted imagery, a Meta spokesperson pointed to new terms of service barring users from generating content that violates privacy and intellectual property rights. In other words, it’s not our problem but the user’s.

What to do

Now more than ever you should assume that anything you post on social media is up for grabs for anyone. An extra point of attention is the use of copyrighted material in your posts. The social media companies will not think twice to use it, and hold you responsible for the fact that they copied them from you without asking.

And please don’t believe all the posts, especially rampant on Facebook, that you can protect your content by copying and pasting some 10 year old post that has done at least twenty laps on the platform. You can’t.

Be warned that based on the AI-assigned definitions in the updated terms by X, your access to certain content might be limited, or even cut off. As a consequence, you may find it harder to reach your intended audience. Maybe it really is time to switch to a different platform.

We don’t just report on threats—we remove them

Cybersecurity risks should never spread beyond a headline. Keep threats off your devices by downloading Malwarebytes today.