AI chatbots are on the rise - and they are using my website and its content as training material. Neither did I agree with them scraping my sites content nor will they involve me financially in their profits.
So, I decided to block them from accessing my website.
When researching this topic, I stumbled upon the ChatGPT documentation. The authors suggest to add this to the /robots.txt
file:
User-agent: GPTBot
Disallow: /
But a robots.txt file is only a friendly recommendation for the ChatGPT bot - it is by far not guaranteed that bots will respect this file. In the future, the bot maintainers could decide to ignore robots.txt files - the solution would be useless.
So I decided to block the ChatGPT bot entirely using HTTP error codes. Now I check for the User-Agent header the ChatGPT documentation suggests. If they decide to change or discontinue the User-Agent header, I can find another solution - for example, block the suggested IP range or such a thing.
I created snippets for nginx and Apache httpd webservers to block ChatGPT bots (link to the GitHub repository at the bottom).
To add some irony, I will reject the ChatGPT bots with a HTTP 402 “Payment Required” status code.
For the nginx webserver, I created a config file that can be included in every location {}
block in the nginx config:
if ($http_user_agent ~* ^(.*)GPTBot(.*)$) {
return 402;
}
This will end the request with a 402 status code if the User-Agent header contains “GPTBot”.
At the end of each server {}
block in the nginx configuration, we have to include a handler for 402 status codes:
error_page 402 /402.html;
location = /402.html {
internal;
root /var/www/aiblock;
}
After that, a custom error page will be served whenever a HTTP status code 4ß2 is generated.
To make things easier, I put both snippets in a distinct config file. They can be included into every location or server block.
For the Apache httpd webserver, we can include the necessary steps into the .htaccess
file inside the document root:
RewriteEngine on
RewriteBase /
RewriteRule ^ - [E=CHATGPT_REDIRECT:False]
RewriteCond %{HTTP:User-Agent} GPTBot [NC]
RewriteRule ^ - [E=CHATGPT_REDIRECT:True]
RewriteCond %{REQUEST_URI} ^.*/402.html$
RewriteRule ^ - [E=CHATGPT_REDIRECT:False]
RewriteCond %{ENV:CHATGPT_REDIRECT} =True
RewriteRule ^(.*)$ - [R=402,L]
ErrorDocument 402 /402.html
This snippet will assign the variable CHATGPT_REDIRECT
to True
if the User-Agent header contains “GPTBot”.
To avoid redirect loops, the variable will be set to false if the served page is 402.html
.
After that, if the variable is True, the 402 status will be served.
You can find the code and documentation on how to use “aiblock” in my GitHub repository (MIT license): https://github.com/zalintyre/aiblock
I also created two threads on Mastodon about this topic:
For reference: ChatGPT User-Agent documentation: https://platform.openai.com/docs/gptbot