GPTbot is a web crawler that can search and index website content and use the data. The New York Times has decided to block it. (Unsplash)

OpenAI’s GPTbot Prohibited from Utilizing New York Times Content for Artificial Intelligence Training

Nitin MisraAugust 22, 23 Artificial Intelligence

The New York Times was considering legal action against OpenAI, the creator of ChatGPT, after allegations that the AI models were being trained using the newspaper’s copyrighted content. Although no legal action has been taken yet, the prominent news publisher has now opted to prohibit OpenAI’s web crawler from accessing its website. Consequently, OpenAI’s AI foundational models will no longer be able to utilize the website’s content for training purposes.

According to a report by The Verge, NYT has blocked OpenAI’s crawler GPTbot from fetching and indexing website content. The report highlights the publication’s robot.txt page, which clearly indicates that the bot is banned. Using the Internet Archive’s Wayback Machine, which allows users to check web pages at any date in the past, it appears that the bot was blocked on August 17th.

The move comes after OpenAI gave website owners an “opt-out” option to prevent the company from using their site’s content to train its AI models. On August 7, the company explained that its GPTbot can be stopped by going to the robot.txt page. At the same time, it highlighted the use of content in a blog post, saying: “Web pages indexed by the GPTBot user agent may potentially be used to improve future models and will be filtered to remove sources that require access to use a paywall. Collect personally identifiable information (PII) or contain text that violates our policies.”

The New York Times blocks OpenAI’s crawler

For the uninitiated, a web crawler, also known as a web spider, is basically a computer program that can search and automatically index the content of a website. It goes through all the URLs of the website and then uses the information to get information for itself. AI companies today use a lot of such crawlers to train their base models. Twitter recently added a temporary tweet rate limit to prevent such crawlers from stealing content on its platform. Likewise, Reddit has come up with a new API policy to discourage crawlers.

However, OpenAI is one of the few AI companies that offers a direct and simple way to exclude GPTbot.

Last week, a report by NPR revealed that NYT may end up suing the makers of ChatGPT after the two sides failed to agree on a licensing agreement that would have OpenAI pay an agreed amount for using its articles to train AI models.