Search engines powered by AI that refuse to pay are unable to index Reddit content

Febin MathewJuly 25, 24 Bing, Google, Reddit, Search Engine

When Reddit announced last month that it would prevent unauthorized data scraping from its platform, many immediately thought of AI. However, now that the new policy is in place, it seems that not only chatbot creators are being affected. In addition to Brave and Google, other major search engines are also being blocked by the popular forum. Google reportedly has a $60 million annual deal with Reddit, but a Reddit spokesperson explained to ReturnByte that the lack of search results is due to Google’s competitors not meeting the company’s AI training requirements. Reddit claims to be in talks with several of these search engines to resolve the issue.

404 Media reported on Wednesday (and ReturnByte confirmed in our polls) that a search for Reddit results from last week on rival engine Bing (using “site:reddit.com”) returns empty results. The publication reported that DuckDuckGo returned seven links with no descriptions, offering only the note: “We’d like to show you a description here, but the site won’t let us.” The engine seems to have removed even those now, as our test only produced a blank page saying “no results found”.

When Reddit said last month it was updating its Robots Exclusion Protocol (robots.txt) to prevent automated data scraping, it’s now clear that it wasn’t just trying to block AI companies like Perplexity and its controversial “answer engine.” Currently, Google appears to be the only search engine that has the right to index Reddit and return results from the “front page of the Internet”.

A Reddit spokesperson told ReturnByte on Wednesday that claims that the lost search results were a result of its Google deal are not true. “We will block any crawler that does not want to commit to not using crawl data for AI training, consistent with our public content policy and enforcement of the updated robots.txt file,” the company said. “Anyone who uses Reddit content must follow our policies, including those in place to protect redditors. We are selective about who we work with and trust a wide range of Reddit content.”

Meanwhile, a source familiar with Reddit’s thinking told ReturnByte on Wednesday that Bing’s omission stems from Microsoft’s refusal to accept Reddit’s AI indexing terms. Instead, the maker of Bing insisted that its standard web controls were sufficient. The source claims that Microsoft’s stance conflicts with Reddit’s privacy policy, leading to dead ends and empty search results.

Ubiquitous robots.txt is a web standard that specifies which parts of a website can be crawled. Although many crawlers are known to ignore its guidelines, Google’s standard procedure is to follow it. So, companies scrambling with a profitable contract on the technical side seem to have implemented some manual overrides.

Saga can be seen as the effect of AI chatbots scraping the live web to get results. Courts are slow to determine how much of the open web is fair use for training chatbots, and companies like Reddit, whose bottom line now depends on securing their data from those who don’t pay, are building walls at the expense of the open web. . (Though given Microsoft’s important role in this AI era, having adapted to OpenAI early on, it seems ironic that Bing should lose at least one component to the bill.)

Colin Hayhurst, CEO of the lesser-known “no-tracking” search engine Mojeek, told 404 Media that Reddit is “killing everything for search except Google.” Additionally, the executive said his attempts to contact Reddit were ignored. “It’s never happened to us before,” he said. “Because this happens to us, we get blocked, usually out of ignorance or stupidity or whatever, and when we contact the site, you’re sure to get it resolved, but we’ve never heard back from anyone before.”

Reddit has made no secret of its desire to prevent AI companies from hijacking its data trove in this growing age of AI. Last year, CEO Steve Huffman risked alienating much of his user base by blocking third-party API requests, leading to the demise of beloved apps like Christian Selig’s Apollo. Despite widespread protests among moderators and forum visitors, the company only temporarily lost an insignificant number of users.

The bet seemed to pay off, and Reddit recovered. It became public in March.