Many websites, including Amazon, Airbnb, and Quora, have blocked OpenAI's web crawler from scraping their content to train its AI chatbot. (AP)News 

OpenAI’s Attempt to Halt Web Crawler Proven Ineffective

OpenAI has recently addressed concerns regarding its data collection methods for training ChatGPT, its innovative chatbot. In response, the company has introduced a feature that allows websites to prevent OpenAI from scraping their content. By implementing a simple code, websites can now request OpenAI to refrain from accessing their information, and OpenAI will comply with this request.

Since then, hundreds of sites have closed their doors. A Google search reveals many of them: major online properties like Amazon, Airbnb, Glassdoor, and Quora have added code to their “robots.txt” file, a sort of rules of engagement for many bots—or spiders, as such. known – who browse the Internet.

When I contacted the companies, no one was willing to discuss their reasoning, but it’s pretty obvious: they want to prevent OpenAI from taking content that doesn’t belong to them to train its AI. Unfortunately, it takes a lot more than a line of code to prevent that.

Other online resources with the kind of information an AI system would want have also moved to block the crawler: Furniture retailer Ikea, job site Indeed.com, vehicle comparison resource Kelley Blue Book, and BAILII, the UK’s court records system similar to the US’s PACER (which does not appear to block the bot).

Coding resource website StackOverflow blocks the crawler, but not its rival GitHub — perhaps unsurprisingly, since GitHub’s owner, Microsoft, is a big investor in OpenAI. And as major media companies begin negotiating (or potentially suing) the likes of OpenAI for access to their archives, many have also taken steps to block the bot. According to a study reported by Business Insider, 70 of the top 1,000 websites worldwide have added the code. We can expect this number to grow.

Problem solved? Unlikely. While OpenAI is very generous in allowing sites to prevent their bot from cleaning their content, the gesture rings hollow when you consider that OpenAI’s bot has already been collecting this data for some time. The AI horse has exploded well: adding code at this point is like shouting “Don’t come back, listen!” in a burglar as they disappear into the night with your belongings.

In fact, the move could strengthen OpenAI’s early lead. By setting this precedent, it can argue that newer competitors should do the same, climb the ladder and enjoy being one of the first movers in AI. “What’s certain is that OpenAI is not giving back the data it collects,” noted tech worker-turned-commentator Ben Thompson in a recent edition of his email newsletter.

Of course, web crawlers are just one way OpenAI and other AI companies collect data to train their systems. Recent legal battles between content owners and AI companies have focused on OpenAI, Meta, Google and others’ frequent use of third-party datasets, such as “Books3,” a dataset of about 200,000 books compiled by an independent AI. investigator. Several authors are suing for its use.

OpenAI declined to comment, including when asked whether sites that blocked OpenAI’s crawler could trust that OpenAI wouldn’t use their data if it were obtained by other means. It certainly won’t change what has already been searched. We can only take solace in the fact that OpenAI has acknowledged that consent will be a factor in future scraping efforts. There are hundreds of other bots released by lesser-known AI companies than OpenAI that offer no opt-out option for sites.

Google, which has built a competing chat tool called Bard, wants to start a conversation about the best mechanism for managing consent in AI. But as author Stephen King recently put it, the data is already in the “digital blender” – and it seems there’s very little anyone can do about it now.

More on Bloomberg’s opinion:

  • Can Oxford and Cambridge save Harvard from ChatGPT? Adrian Wooldridge
  • Too much money goes to AI Doomers: Parmy Olson
  • Secret chatbot developers make a big mistake: Dave Lee

This column does not necessarily reflect the opinion of the editorial board or of Bloomberg LP and its owners.

Dave Lee is a US technology columnist for Bloomberg Opinion. Previously, he was a San Francisco-based correspondent for the Financial Times and BBC News.

Related posts

Leave a Comment