Reddit highlights another source of revenue beyond ads: Profitable AI agreements
Reddit Inc. announced on Thursday that artificial intelligence will play a significant role in its business, as stated in its highly anticipated initial public offering filing. This move is expected to open up a potentially lucrative yet controversial revenue stream for the company.
San Francisco-based Reddit, a platform that hosts discussions on thousands of different topics, makes most of its money by selling ads that appear alongside social content. In its filing, the 19-year-old company outlined another additional business: selling content to companies that build chatbots like ChatGPT.
Big tech companies like Google and OpenAI are willing to pay a lot of money for content to improve their big language models, artificial intelligence software built using databases. On Thursday, in addition to its public filing, Reddit announced a deal with Alphabet Inc’s Google that will allow Google’s artificial intelligence products to use Reddit data to improve its technology. Bloomberg had previously reported the existence of a $60 million AI deal.
“Reddit’s vast and unparalleled archive of real, timely and relevant human conversations on literally any topic is an invaluable dataset for a variety of purposes, including search, AI training and research,” Reddit co-founder and CEO Steve Huffman wrote. in a filing that described such deals as an “emerging opportunity” for the company.
In its S-1 filing, Reddit said it entered into license agreements in January worth a combined $203 million with terms ranging from two to three years. The company also said it expected to receive at least $66.4 million from such deals this year.
AI companies enter into licensing agreements to provide more content for their models. In December, OpenAI signed a contract worth tens of millions of euros with Axel Springer SE, which owns Politico and Business Insider. Such agreements have high stakes because AI models are often trained on copyrighted information, obscuring ownership claims. For example, the New York Times sued OpenAI in December for copyright infringement.
Training AI models with user-generated data – such as Reddit hosts – can also come with risks. Content is less trustworthy than news articles, AI researchers say. Reddit “is basically a forum where people post anything,” Giada Pistilli, chief ethics officer at Hugging Face, which makes and hosts AI models. “You find conspiracy theories and all kinds of problematic stuff.”
Os Keyes, a doctoral student at the University of Washington who studies artificial intelligence and data ethics, said Reddit could introduce problematic content into AI systems.
“We’ve already seen that models are prone to hallucinating facts that don’t exist,” Keyes said. They pointed to a notable example in 2013, when Reddit users falsely accused someone of being a suspect in the Boston Marathon bombing. “Things that appear on Reddit are not established facts.”
Reddit said that when partners use its data API, they must stop displaying content that has been removed from the site. The company added that AI companies have previously used Reddit to train models without paying, and that having formal agreements in place will help it take steps such as requiring the removal of content removed for policy violations.
Reddit has been criticized in the past for its handling of toxic and hateful content posted by users, largely moderated by unpaid volunteers. In 2020, some 15 years after the site was founded, Reddit implemented a ban on hate speech. When it comes to moderating problematic content, it’s not always clear where the line is. For example, in 2021, the company announced that it would drop subseries that spread false information related to Covid-19. Days later, after an outcry from several of its own users, Reddit banned that forum, saying it had violated other rules.
The company says that in addition to its moderators, it has internal security teams committed to monitoring its practices through both automation and human review.
If AI models pick up inaccurate content, companies can try to clean it up afterward, Pistilli said, but the process can be difficult. “It’s a lot of effort and a lot of work. A better practice would be to clean your data first,” Pistilli said. “Unfortunately, people prefer quantity over quality.”
It’s still too early to tell how, if at all, Reddit’s unusually vocal user community will react to the licensing move. Last year, thousands of subreddits staged a protest against the company’s decision to raise prices for third-party app developers.