Meta’s AI Assistant Learns From Public Facebook and Instagram Posts
Meta Platforms used public Facebook and Instagram messages to train parts of its new Meta AI virtual assistant, but excluded private messages shared only with family and friends in an effort to respect consumer privacy, the company’s top policy officer told Reuters in an interview.
Meta also did not use private chats on its messaging services as model training data and took steps to filter private data from public data sets used for training, said Meta’s global affairs director Nick Clegg, speaking on the sidelines of the company’s annual Connect conference. This week.
“We’ve tried to exclude datasets with a large majority of personal data,” Clegg said, adding that the “vast majority” of data Meta used for training was publicly available.
He cited LinkedIn as an example of a website whose content Meta consciously chose not to use for privacy reasons.
Clegg’s comments come as tech companies such as Meta, OpenAI and Alphabet’s Google have been criticized for using data scraped from the internet without permission to train their AI models, which ingest vast amounts of data to summarize and create images.
Companies are considering how to deal with private or copyrighted material downloaded in the process that their AI systems can play, while the creators accuse them of copyright infringement.
Meta AI was the most significant product of the company’s first consumer AI tools, which CEO Mark Zuckerberg unveiled on Wednesday at Meta’s annual Connect product conference. This year’s event was dominated by talk about artificial intelligence, unlike previous conferences that focused on augmented and virtual reality.
Meta built the assistant with a custom model based on the powerful Llama 2 large-language model that the company released for public commercial use in July, as well as a new Emu model that generates images in response to text prompts, it said.
The product can produce text, sound and images, and has access to real-time information in cooperation with Microsoft’s Bing search engine.
The public Facebook and Instagram posts used for Meta AI training included both text and photos, Clegg said.
Those messages were used to train Emu on the imaging elements of the product, while the chat functionality was based on Llama 2 with some publicly available and annotated datasets added, a Meta spokesperson told Reuters.
Interactions with Meta AI can also be used to improve features in the future, the spokesperson said.
Clegg said Meta placed security restrictions on what content the Meta AI tool could generate, such as a ban on creating photo-realistic images of public figures.
Regarding copyrighted material, Clegg said he expected a “fair amount of litigation” over “whether or not creative content is covered by the existing fair use doctrine,” which allows limited use of copyrighted works for purposes such as commentary, research. and parody.
“We believe it is, but I highly doubt it will go to trial,” Clegg said.
Some companies with image creation tools make it easy to reproduce iconic characters like Mickey Mouse, while others have paid for the materials or deliberately avoided including them in the training data.
For example, this summer OpenAI signed a six-year contract with content provider Shutterstock to use the company’s image, video and music libraries for training.
Asked if Meta had taken such steps to prevent copying of copyrighted images, a Meta spokesperson pointed to new terms of service that prevent users from creating content that violates privacy and intellectual property rights.