Forum Discussion

Nimbostratus

Dec 22, 2017

Applying different policies for authorized web scrapers

I'm working on a method for dealing with external web scrapers in my organization. Some web scrapers are allowed, some aren't. My task is to define rules to authorize the "good" bots and block the re...

Nimbostratus

Hi,

A dual-policy setup is justified in some cases, but here you can ease your management efforts and go with one. When it comes to creating exceptions for "good bots", you can identify those by User-Agent header.

For example, those are top well-known "good bots" https://www.keycdn.com/blog/web-crawlers/

iRule logic for allowing those specific bots, without globally disabling the violation itself would work as summarized by 5 steps below. This will not do rate-throttling but will help you distinguish well-known bots from the rest.

Create a list of good bot user-agents in string type LTM data-group
Run a check against client's User-Agent in HTTP_REQUEST event. If it's one of the good bots (matches with any value in your LTM data-group), set a variable that you can refer to later on in ASM-related iRule events (i.e. set goodBot 1).
Catch the occurrence of that bot violation with a simple IF condition (Must enable iRule events in ASM policy settings)
Check to make sure only 1 violation is triggered (count violations) to make sure people couldn't exploit this exception and bypass your ASM WAF by simply presenting something like User-Agent: GoogleBot as a request header.
When only that specific violation is triggered, run a check against variable you set in step 2. If there's a match and you know the bot violation occurred for a good bot, then disable ASM blocking with ASM::unblock command. If there's no match, do nothing

If you have your own bot/crawler, just make up a User-Agent header if it doesn't already have one. Pardon for not providing you code, I lack ability to test at this hour and don't want to provide something completely untested.

Hope this will get you started!

nathan_hoffman_

Nimbostratus

Dec 26, 2017

Hi Hannes,

I really appreciate your reply. I like the idea of checking the user-agent and checking for just 1 violation to prevent workarounds. I'll have to check the logs and see what user-agent our bots are using.

The main reason I was thinking about doing it this way is because our apps that are getting scraped are behind an auth wall, so bots have to present creds to get in, so we aren't getting Google, Bing, etc. The good bots are used by companies that are acting as agents or intermediaries to get data on behalf of groups of real users, and we are required to allow them or the real users who hire these companies can claim we deny access to their own data. The problem is when the bots scrape too fast, thus the daytime and nighttime rate ceilings. We have rate requirements that the bot handlers are aware of but don't always respect, so we need a technical control.

I'll check the user-agents. Assuming the bots DON'T use unique user-agents, does it look like my idea will work? Any obvious alternatives? I'm in a bit of a time crunch and also don't have full admin rights in the ASM (we use a delegated admin model), so trial and error will take some time that I don't have much of.

One other question: will this iRule fire before the ASM processes any security policy, so I can choose whether one of the 2 custom policies gets applied and have the default be used otherwise? Do I need to make any config changes to achieve this? I read about setting the "Trigger ASM iRule event setting" in a security policy, but I want this iRule to run before the default security policy if possible. I am still very new to ASM so I don't know if this is feasible, or what is the accepted way of selecting which security policy to use.

I really appreciate your help! Please feel free to recommend reading material in lieu of an answer if that seems appropriate.

Nathan

Forum Discussion

Applying different policies for authorized web scrapers

Recent Discussions

Find GSLB pool membership from vip IP

Pool Members communication with B-party via F5 VIP

Server 2 causing application slowness

F5Access | MacOS Sonoma

AS3 Deployments (shared objects)

Related Content

JWT authorization with NGINX Ingress Controller

iRule - Authorization Bearer / Basic

OWASP Tactical Access Defense Series: Broken Object Property Level Authorization and BIG-IP APM

Resolve Citrix Secure Ticket Authority (STA)

WAF for APM Oauth Authorization VS