How to Reliably Block AI Crawlers Using HAProxy Enterprise

The robots.txt file is a time-honored point of control for website publishers to assert whether or not their websites should be crawled by bots of various kinds. However, it turns out that AI crawlers from large language model (LLM) companies often ignore the contents of robots.txt and crawl your site regardless. 

Now you may indeed want your site crawled by some or all AI crawlers, for reasons that may include:

  • Wanting LLMs to share accurate information provided on your site

  • Wanting to support AI and LLMs in general

However, you may also want to block some or all AI crawlers, for reasons that may include: 

  • The desire to have LLMs pay you for access to the content on your site, as some news sites have recently arranged with specific LLM providers

  • Overall concern about bots hurting your site’s performance

  • Not wanting to accommodate AI and LLMs in general

Whatever your reasoning, you may want to block these crawlers—but to do so, you'll need to take additional steps. HAProxy Enterprise enables you to do this, as we describe in this article. 

HAProxy Enterprise provides you with several advantages when you block crawlers from any source, including LLMs. HAProxy Enterprise sends zero traffic to third parties for classification of bots. All the work happens within your own systems, so you can block the crawlers you want to block without incurring extra latency, all while avoiding unnecessary compliance and security concerns.

Understanding bot management within HAProxy Enterprise

HAProxy Enterprise includes the powerful HAProxy Enterprise Bot Management Module, which provides fast, reliable, and flexible identification and categorization of bots attempting to access websites or applications. It also helps make routing decisions for bot traffic and various crawlers—whether they announce themselves or not. Let’s explore how our bot management features can block AI bots from accessing your site.

You might be wondering why we wouldn't just detect User-Agent strings. User Agents are some of the easiest things to fake in requests. Meanwhile, our Bot Management Module uses multiple techniques to verify that the specific User-Agent string in each request is authentic. That’s why we'll refer to “verified” categories of bots below.

Installing the HAProxy Enterprise Bot Management Module

Our bot management documentation (login required) offers step-by-step instructions for basic module installation. Note that you'll need either an HAProxy Enterprise trial or an active subscription to access these resources. However, we'll share some important details here to help demonstrate how the HAProxy Enterprise Bot Management Module works.

Generally, we recommend first installing the latest version of the HAProxy Enterprise Bot Management Module and downloading the latest version of the corresponding data file containing the magic of bot management detection. Next, you'll add a basic scoring configuration to a single frontend or all frontends, shown below:

global
module-load hapee-lb-botmgmt.so
botmgmt-data-file /opt/hapee-2.9/data-hapee
frontend www
mode http
bind :443 ssl crt /path/to/ssl.pem
filter botmgmt

From now on, your requests will contain data telling you how your traffic was scored. Let’s look at blocking AI bots next.

Blocking AI bots

blocking-ai-bots

The HAProxy Enterprise Bot Management Module gives you a lot of information, including:

  • Bot management scores

  • Detection information for verified crawlers such as Googlebot

  • Detection information for verified bots such as an AI crawler

We can simply use a verified bot category to detect when an AI crawler accessed your site. Add the following line of configuration to your frontend section, immediately after the filter botmgmt line:

acl is_ai_bot var(txn.verified_bot_category) -m str AI-crawler

The acl is_ai_bot will be true if we've detected an AI crawler. To finish up, you can now add an extra line to deny this traffic:

http-request deny deny_status 403 if is_ai_bot

Denying individual AI crawlers

A universal blocking strategy for all crawlers isn't always necessary. For example, maybe you want to generally allow AI crawlers but specifically deny traffic from ClaudeBot. You can do this by adding two lines to your configuration:

acl is_ai_claude var(txn.verified_bot) -m str ClaudeBot
http-request deny deny_status 403 if is_ai_claude

In the same way, you can use an ACL to detect and block ChatGPT-User and other similar bots.

Reviewing our complete configuration

We've identified the various configuration components behind bot management in HAProxy Enterprise, and how these form a simple-yet-capable blocking strategy for AI crawlers. Here's the full configuration example as it appears within your file:

global
module-load hapee-lb-botmgmt.so
botmgmt-data-file /opt/hapee-2.9/data-hapee
frontend www
mode http
bind :443 ssl crt /path/to/ssl.pem
filter botmgmt
acl is_ai_bot var(txn.verified_bot_category) -m str AI-crawler
http-request deny deny_status 403 if is_ai_bot

Tackle tomorrow's bots and crawlers today

Using HAProxy Enterprise Bot Management Module, you can easily block traffic from bots, verified crawlers, and/or AI crawlers. We've outlined how just a few lines of code with HAProxy Enterprise can noticeably improve your overall bot management strategy, while safeguarding your content and application infrastructure against modern threats, which might include AI crawlers. This specialized blocking will become even more critical as LLMs grow more popular and plentiful. 

To learn more about HAProxy Enterprise Bot Management Module and HAProxy Enterprise's built-in security features, check out our security solution and the HAProxy Enterprise datasheet.

Subscribe to our blog. Get the latest release updates, tutorials, and deep-dives from HAProxy experts.