How to block ChatGPT on your website

AI chatbots are on the rise - and they are using my website and its content as training material. Neither did I agree with them scraping my sites content nor will they involve me financially in their profits.

So, I decided to block them from accessing my website.

Blocking bots from a website

When researching this topic, I stumbled upon the ChatGPT documentation. The authors suggest to add this to the /robots.txt file:

User-agent: GPTBot
Disallow: /

But a robots.txt file is only a friendly recommendation for the ChatGPT bot - it is by far not guaranteed that bots will respect this file. In the future, the bot maintainers could decide to ignore robots.txt files - the solution would be useless.

The hard way - using HTTP errors

So I decided to block the ChatGPT bot entirely using HTTP error codes. Now I check for the User-Agent header the ChatGPT documentation suggests. If they decide to change or discontinue the User-Agent header, I can find another solution - for example, block the suggested IP range or such a thing.

I created snippets for nginx and Apache httpd webservers to block ChatGPT bots (link to the GitHub repository at the bottom).

To add some irony, I will reject the ChatGPT bots with a HTTP 402 “Payment Required” status code.

nginx webserver

For the nginx webserver, I created a config file that can be included in every location {} block in the nginx config:

if ($http_user_agent ~* ^(.*)GPTBot(.*)$) {
    return 402;
}

This will end the request with a 402 status code if the User-Agent header contains “GPTBot”.

At the end of each server {} block in the nginx configuration, we have to include a handler for 402 status codes:

error_page 402 /402.html;

location = /402.html {
    internal;
    root /var/www/aiblock;
}

After that, a custom error page will be served whenever a HTTP status code 4ß2 is generated.

To make things easier, I put both snippets in a distinct config file. They can be included into every location or server block.

Apache httpd webserver

For the Apache httpd webserver, we can include the necessary steps into the .htaccess file inside the document root:

RewriteEngine on
RewriteBase /

RewriteRule ^ - [E=CHATGPT_REDIRECT:False]
RewriteCond %{HTTP:User-Agent} GPTBot [NC]
RewriteRule ^ - [E=CHATGPT_REDIRECT:True]
RewriteCond %{REQUEST_URI} ^.*/402.html$
RewriteRule ^ - [E=CHATGPT_REDIRECT:False]

RewriteCond %{ENV:CHATGPT_REDIRECT} =True
RewriteRule ^(.*)$ - [R=402,L]

ErrorDocument 402 /402.html

This snippet will assign the variable CHATGPT_REDIRECT to True if the User-Agent header contains “GPTBot”. To avoid redirect loops, the variable will be set to false if the served page is 402.html.

After that, if the variable is True, the 402 status will be served.

Links

You can find the code and documentation on how to use “aiblock” in my GitHub repository (MIT license): https://github.com/zalintyre/aiblock

I also created two threads on Mastodon about this topic:

For reference: ChatGPT User-Agent documentation: https://platform.openai.com/docs/gptbot

How to block ChatGPT on your website

09.08.2023 Web nginx Apache ChatGPT AI

Blocking bots from a website

The hard way - using HTTP errors

nginx webserver

Apache httpd webserver

Links

Related posts

... some things you might also be interested in

Including Mermaid diagrams in Hugo

29.12.2022 | Web | Hugo Mermaid