Mitigate AI Scraping bots using F5 Distributed Cloud Web Application Firewall
Introduction
Web scraping isn’t a new phenomenon, but the advent of AI and using scraped data to train AI models has elevated concerns associated with it. Previously, generative AI models could only use data that was being fine-tuned. With recent improvements like RAG, AI Agents, MCP, and others, LLMs can now get dynamic content from websites.
IETF already published a standard way to control the data access for generic bots using directives like Disallow. Bots conforming to this RFC 9309 should follow these rules by looking into robots.txt file in root folder. To deal specifically with AI Scraping, a new draft (draft-canel-robots-ai-control-00) was suggested to control the data access for these automated AI crawlers. It added 2 more directives: DisallowAITraining which prohibits data access and AllowAITraining directive, which permits data fetching.
Mitigation Solutions
Let's see some of the mitigation solutions available in F5 Distributed Cloud Web Application Firewall (F5 XC WAF). For this demo, we will focus on the below 5 solutions and will dive into each solution one by one.
First Solution:
This solution focuses on preventing bots, which follow the new draft (draft-canel-robots-ai-control-00). For this demo, we are using the Python Scrapy library, which allows users to customize the existing code.
- Open the Scrapy settings file and update fields to modify user-agents, follow robots rules, etc. Scrapy uses Protego as the default parser, and by adding our new directives in Protego file, we can direct Scrapy to follow these new AI directives.
Scrapy settingsProtego directives - Below is a sample script we have written to extract title and response data from a web application page using Scrapy.
Python sample script to scrap web content
Let's assume you want to use F5 Distributed Cloud WAF to control the generation of the robots.txt file. To enable F5 XC to respond with the robots.txt file, we will create a Direct Response Route, which is a way to specify the response F5 XC will send back to the client. In this case, we will specify the contents of the robots.txt file and add the new disallow ai training directive.
So let’s configure the load balancer route configuration as shown below
Note the second route, which is used for accessing all other web resources. First route is used to respond data for the robots.txt file content to allow or disallow AI Training as part of direct response body.
Next we will run the crawler with default value of not obeying robots file and we can access the website content.
If we change the robots directives to obey and rerun the crawler, we can see the request is restricted. We can also modify the robots directives to allow data to be scraped for AI training purposes. We can do this by changing DisallowAITraining directive to AllowAITraining .
Second Solution
Crawlers, which are not supporting new AI directives but still obeying RFC 9309 can be mitigated using Disallow directive in response.
Modify the load balancer configuration of Direct Response Route as shown above and rerun the script.
We can validate request will download or blocked as per robots obey field.
Third Solution
This solution focuses on bots and scraping tools, which don’t obey robots directives. For example, let’s assume we know the User-Agent header and we only want to block this specific scraping tool. To test this scenario, update the user-agent field in the settings.
We can configure F5 XC with a client blocking rule for this specific User-Agent header value as shown below.
Rerunning the script after these changes shows that the request is blocked and F5 XC Security logs confirm the request is blocked because of the client blocking rule.
Fourth solution
This solution covers blocking entire bot categories using F5 WAF, which can block all suspicious bots. First we will create a WAF with bot suspicious action configured field set to Block and enable this WAF on your load balancer as below.
If we rerun the same script, the request is blocked by XC WAF. Security logs show the request was blocked because it was identified as coming from a suspicious bot.
Fifth Solution
We can also create a service policy with combination of TLS fingerprint and User-Agent header to block bots that try to masquerade their User-Agent header by pretending to be a legitimate client.
First let's go through the Security Analytics logs, examine the JSON view and copy the fingerprint as shown below for the bot which needs to be blocked.
Next, we will configure a service policy on Load Balancer with 2 rules.
One rule for blocking the requests matching the fingerprint and a second rule to allow all other requests.
If needed, we can also add a combination of user-agents, paths, etc.
We will save the changes and rerun the script. We can see request is blocked and the F5 XC event logs show the request is blocked because of our service policy.
Conclusion:
In this article, we have covered an introduction to AI Scraping bots. Later, we showed different ways to block AI Scraping bots using F5 Distributed Cloud WAF.