Websites claim that AI startup Anthropic is circumventing their anti-scraping regulations and protocols.
A freelancer has accused Anthropic, the AI startup responsible for the Claude large language models, of disregarding the “do not crawl” robots.txt protocol to scrape data from its websites. Additionally, iFixit CEO Kyle Wiens stated that Anthropic has violated the website’s policy against using its content for AI model training. Freelancer CEO Matt Barrie described Anthropic’s ClaudeBot as the most aggressive scraper, with the company’s crawler allegedly generating 3.5 million visits to his website in just four hours. Wiens also reported that Anthropic’s bot accessed iFixit’s servers a million times in a 24-hour period, causing strain on their devops resources.
In June, Wired accused another AI company, Perplexity, of crawling its website despite the Robots Exclusion Protocol, or robots.txt file. The robots.txt file usually contains instructions for the crawlers on which pages they can and cannot access. Although compliance is voluntary, it has been mostly ignored by bad bots. After Wired’s piece appeared, a startup called TollBit, which connects AI companies with content publishers, announced that it’s not just Perplexity that is bypassing robots.txt signals. Although it did not name names, Business Insider said it has learned that OpenAI and Anthropic also ignored the protocol.
Barrie said Freelancer initially tried to refuse the bot’s access requests, but eventually had to block Anthropic’s crawler entirely. “This is egregious scraping [that] slows down the site for everyone who uses it and ultimately affects our revenue,” he added. As for iFixit, Wiens said the website has set alarms for high traffic, and his people were woken up at 3 a.m. by Anthropic’s activity. The company’s crawler stopped hijacking iFixit after adding a line to its robots.txt file specifically blocking Anthropic’s bot.
The AI startup told The Information that it respects the robots.txt file and that its crawler “respected the signal when it was deployed by iFixit.” It also said it seeks to “minimize disruption by considering how quickly [it indexes] the same domains,” which is why it is now investigating the incident.
AI companies use crawlers to gather content from websites that they can use to train their generative AI techniques. They have been the subject of several lawsuits as a result, with publishers accusing them of copyright infringement. Companies like OpenAI have entered into agreements with publishers and websites to prevent new lawsuits from being filed. OpenAI’s content partners so far have included News Corp, Vox Media, Financial Times and Reddit. iFixit’s Wiens appears to be open to the idea of signing a deal for articles on the repair guide website as well, telling Anthropic in a tweet that he’s open to discussing licensing the content for commercial use.
If any of those requests accessed our terms of service, they would have told you that use of our content expressly forbidden. But don’t ask me, ask Claude!
If you want to have a conversation about licensing our content for commercial use, we’re right here. pic.twitter.com/CAkOQDnLjD
— Kyle Wiens (@kwiens) July 24, 2024