How to Block AI Chatbots From Scraping Your Website's Content – MUO – MakeUseOf
Concerned about AI chatbots scraping your website for content? Fortunately, you can block them from doing so. Here’s how.
As things stand, AI chatbots have a free license to scrape your website and use its content without your permission. Concerned about your content being scraped by such tools?
The good news is, you can stop AI tools from accessing your website, but there are some caveats. Here, we show you how to block the bots using the robots.txt file for your website, plus the pros and cons of doing so.
AI chatbots are trained using multiple datasets, some of which are open-source and publicly available. For example, GPT3 was trained using five datasets, according to a research paper published by OpenAI:
Common Crawl includes petabytes (thousands of TBs) of data from websites collected since 2008, similarly to how Google’s search algorithm crawls web content. WebText2 is a dataset created by OpenAI, containing roughly 45 million web pages linked to from Reddit posts with at least three upvotes.
So, in the case of ChatGPT, the AI bot isn’t accessing and crawling your web pages directly–not yet, anyway. Although, OpenAI’s announcement of a ChatGPT-hosted web browser has raised concerns that this could be about to change.
In the meantime, website owners should keep an eye on other AI chatbots, as more of them hit the market. Bard is the other big name in the field, and very little is known about the datasets being used to train it. Obviously, we know Google’s search bots are constantly crawling web pages, but this doesn’t necessarily mean Bard has access to the same data.
The biggest concern for website owners is that AI bots like ChatGPT, Bard, and Bing Chat devalue their content. AI bots use existing content to generate their responses, but also reduce the need for users to access the original source. Instead of users visiting websites to access information, they can simply get Google or Bing to generate a summary of the information they need.
When it comes to AI chatbots in search, the big concern for website owners is losing traffic. In the case of Bard, the AI bot rarely includes citations in its generative responses, telling users which pages it gets its information from.
So, aside from replacing website visits with AI responses, Bard removes almost any chance of the source website receiving traffic–even if the user wants more information. Bing Chat, on the other hand, more commonly links to information sources.
In other words, the current fleet of generative AI tools are using the work of content creators to systematically replace the need for content creators. Ultimately, you have to ask what incentive this leaves website owners to continue publishing content. And, by extension, what happens to AI bots when websites stop publishing the content they rely upon to function?
If you don’t want AI bots using your web content, you can block them from accessing your site using the robots.txt file. Unfortunately, you have to block each individual bot and specify them by name.
For example, Common Crawl’s bot is called CCBot and you can block it by adding the following code to your robots.txt file:
This will block Common Crawl from crawling your website in the future but it won’t remove any data already collected from previous crawls.
If you’re worried about ChatGPT’s new plugins accessing your web content, OpenAI has already published instructions for blocking its bot. In this case, ChatGPT’s bot is called ChatGPT-User and you can block it by adding the following code to your robots.txt file:
Blocking search engine AI bots from crawling your content is another problem entirely, though. As Google is highly secretive about the training data it uses, it’s impossible to identify which bots you’ll need to block and whether they’ll even respect commands in your robots.txt file (many crawlers don’t).
Blocking AI bots in your robots.txt file is the most effective method currently available, but it’s not particularly reliable.
The first problem is that you have to specify each bot you want to block, but who can keep track of every AI bot hitting the market? The next issue is that commands in your robots.txt file are non-compulsory instructions. While Common Crawl, ChatGPT, and many other bots respect these commands, many bots don’t.
The other big caveat is that you can only block AI bots from performing future crawls. You can’t remove data from previous crawls or send requests to companies like OpenAI to erase all of your data.
Unfortunately, there’s no simple way to block all AI bots from accessing your website, and manually blocking each individual bot is almost impossible. Even if you keep up with the latest AI bots roaming the web, there’s no guarantee they’ll all adhere to the commands in your robots.txt file.
The real question here is whether the results are worth the effort, and the short answer is (almost certainly) no.
There are potential downsides to blocking AI bots from your website, too. Most of all, you won’t be able to collect meaningful data to prove whether tools like Bard are benefiting or harming your search marketing strategy.
Yes, you can assume that a lack of citations is harmful, but you’re only guessing if you lack the data because you blocked AI bots from accessing your content. It was a similar story when Google first introduced featured snippets to Search.
For relevant queries, Google shows a snippet of content from web pages on the results page, answering the user’s question. This means users don’t need to click through to a website to get the answer they’re looking for. This caused panic among website owners and SEO experts who rely on generating traffic from search queries.
However, the kind of queries that trigger featured snippets are generally low-value searches like “what is X” or “what’s the weather like in New York”. Anyone who wants in-depth information or a comprehensive weather report is still going to click through, and those who don't were never all that valuable in the first place.
You might find it's a similar story with generative AI tools, but you'll need the data to prove it.
Website owners and publishers are understandably concerned about AI technology and frustrated by the idea of bots using their content to generate instant responses. However, this isn’t the time for rushing into counteroffensive moves. AI technology is a fast-moving field, and things will continue to evolve at a rapid pace. Take this opportunity to see how things play out and analyze the potential threats and opportunities AI brings to the table.
The current system of relying on content creators’ work to replace them isn’t sustainable. Whether companies like Google and OpenAI change their approach or governments introduce new regulations, something has to give. At the same time, the negative implications of AI chatbots on content creation are becoming increasingly apparent, which website owners and content creators can use to their advantage.