This month, Fortune.com reported that TikTok’s web scraper — known as Bytespider — is aggressively sucking up content to fuel generative AI models. We noticed the same thing when looking at bot management analytics produced by HAProxy Edge — our global network that we ourselves use to serve traffic for haproxy.com. Some of the numbers we are seeing are fairly shocking, so let’s review the traffic sources and where they originate.
Our own measurements, collected by HAProxy Edge and filtered to traffic for haproxy.com, show a few interesting figures:
Nearly 1% of our total traffic comes from AI crawlers
Close to 90% of that traffic is from Bytespider, by Bytedance (the parent company of TikTok)
While Bytespider is currently the most prevalent AI crawler, showing that Bytedance is currently the top source, we have previously observed others (such as ClaudeBot) taking the top spot. AI crawler activity, like all traffic, changes over time.
What does AI traffic mean for us – and you?
While we are primarily a technology company, we also consider ourselves to be a content company; we invest in original, human-authored content — such as documentation or blogs that provide helpful information to our users and wider audience.
Content-scraping bots existed long before LLMs started crawling the web for generative AI applications, and they have usually been considered undesirable visitors on content-heavy websites. Many businesses would not consent to the scraping and possible re-use of their content, in full or in part, by a third party.
However, AI crawlers used by LLMs come with unique risks and opportunities.
On one hand, an LLM might re-use the original content in full, or with some modification, or remixed with other content at the level of an LLM token (roughly the level of a single word). It is unlikely that a user will know where the original content came from. In cases where an LLM “hallucinates”, a user might receive inaccurate information, for example when requesting code or configuration instructions.
On the other hand, with many users turning to AI chatbots as an alternative to traditional search engines, this is becoming an important channel for discovery and awareness. Businesses might want their brand or product information to be supplied by chatbots in response to user queries. For example, if a user asks for a list of relevant products, a business might want their product to be included in the list, along with features and benefits.
While we don’t limit AI crawlers on our website right now, we will have to make a decision whether to continue to allow them or not. Other businesses running content-heavy public websites will likely find themselves having to make the same decision: to protect the value of their content, or to allow the dissemination of information about their brand and products via these new channels.
What can you do to protect your content from AI crawlers?
If bots and the risk of content replication pose a threat to your business, you need a strategy to mitigate this risk and a technology solution that enables you to implement it.
A common method of disallowing bots is to use the robots.txt
file on your website domain. However, some AI crawlers (including Bytespider) don’t identify themselves transparently; they try to pretend to be real users and ignore instructions in robots.txt
. It is for this reason that we — like the Fortune.com article — describe the crawling as “aggressive”. It is not only a matter of scale but also the way it is being done.
Therefore, any technical solution for managing AI crawlers and scrapers must be capable of accurately identifying such bots, even when they are designed to be hard to distinguish from humans.
HAProxy Enterprise customers already benefit from the HAProxy Enterprise Bot Management Module, announced in version 2.9. This technology combines a simple and efficient method for identifying and classifying bots with HAProxy’s legendary flexibility, to support a range of bot management strategies — such as blocking, rate limiting, or challenging via CAPTCHA.
Our guide, How to Reliably Block AI Crawlers Using HAProxy Enterprise, shows you how to identify and block these bots (either individually or as a category) using a few lines of configuration on HAProxy Enterprise. Other providers, such as our friends at Cloudflare, recently provided a similar solution.
Where does our data come from, and how do we use it to improve bot management?
Our traffic statistics from HAProxy Edge show that the scale of AI crawler traffic is significant and growing fast. Let’s talk about where our data comes from and how we use it.
HAProxy Edge provides a globally distributed application delivery network (ADN) that provides fully managed application services, accelerated content delivery, and a secure partition between external traffic and your network.
By analyzing the traffic connecting to websites and applications hosted on HAProxy Edge (which includes haproxy.com), we can build a picture of global traffic trends. We can also filter these traffic metrics to show AI crawlers. Our bot management technology performs rapid identification and classification of bots (and humans), including identification of known AI crawlers such as:
Bytespider (TikTok)
OpenAI search bot and ChatGPT variants
PerplexityBot
Google AI crawler
ClaudeBot
Others
Our data science team uses the threat intelligence data provided by HAProxy Edge to train our security models with the use of machine learning, resulting in extremely accurate and efficient detection algorithms for bots and other threats – without relying on static lists and regex-based attack signatures. We use these algorithms to power the security layers in HAProxy Edge itself and HAProxy Enterprise and HAProxy Fusion. This includes the HAProxy Enterprise WAF (powered by the Intelligent WAF Engine) and the HAProxy Enterprise Bot Management Module.
For businesses looking for fully managed application services, HAProxy Edge provides bot management and other security features, backed by HAProxy Technologies’ authority on all aspects of the load balancing and traffic control stack. Contact us if you’d like a demo or a trial.
Subscribe to our blog. Get the latest release updates, tutorials, and deep-dives from HAProxy experts.