More Web Scraping - Bot Detection

In my last article, I discussed the issue of web scraping and why it could be a problem for many individuals and/or companies. In this article, we will dive into some of the technical details regarding bots and how the BIG-IP Application Security Manager (ASM) can detect them and block them from scraping your website.

What Is A Bot?

A bot is a software application that runs automated tasks and typically performs these tasks much faster than a human possibly could. In the context of web scraping, bots are used to extract data from websites, parse the data, and assemble it into a structured format where it can be presented in a useful form. Bots can perform many other actions as well, like submitting forms, setting up schedules, and connecting to databases. They can also do fun things like add friends to social networking sites like Twitter, Facebook, Google+, and others. A quick Internet search will show that many different bot tools are readily available for download free of charge.

We won't go into the specifics of each vendor's bot application, but it's important to understand that they are out there and are very easy to use.

Bot Detection

So, now that we know what a bot is and what it does, how can we distinguish between malicious bot activity and harmless human activity? Well, the ASM is configured to check for some very specific activities that help it determine if the client source is a bot or a human. By the way, it's important to note that the ASM can accurately detect a human user only if clients have JavaScript enabled and support cookies. There are three different settings in the ASM for bot detection: Off, Alarm, and Alarm and Block. Obviously, if bot detection is set to "Off" then the ASM does not check for bot activity at all. The "Alarm" setting will detect bot activity and record attack data, but it will allow the client to continue accessing the website. The "Alarm and Block" setting will detect bot activity, record the attack data, and block the suspicious requests. These settings are shown in the screenshot below.

Once you apply the setting for bot detection, you can then tune the ASM to begin checking for bots that are accessing your website. The bot detection utilizes four different techniques to detect and defend against bot activity. These include Rapid Surfing, Grace Interval, Unsafe Interval, and Safe Interval.

Rapid Surfing detects bot activity by counting the client's page consumption speed. A page change is counted from the page load event to its unload event. The ASM configuration allows you to set a maximum number of page changes for a given time period (measured in milliseconds). If a page changes more than the maximum allowable times for the given time interval, the ASM will declare the client as a bot and perform the action that was set for bot detection (Off, Alarm, Alarm and Block). The default setting for Rapid Surfing is 5 page changes per second (or 1000 milliseconds).

The Grace Interval setting specifies the maximum number of page requests the system reviews while it tries to detect whether the client is a human or a bot. As soon as the system makes the determination of human or bot it ends the Grace Interval and stops checking for bots. The default setting for the Grace Interval is 100 requests.

Once the system determines that the client is valid, the system does not check the subsequent requests as specified in the Safe Interval setting. This setting allows for normal client activity to continue since the ASM has determined the client is safe (during the Grace Interval). Once the number of requests sent by the client reaches the value specified in the Safe Interval setting, the system reactivates the Grace Interval and begins the process again. The default setting for the Safe Interval is 2000 requests. This Safe Interval is nice because it lowers the processing overhead needed to constantly check every client request.

If the system does not detect a valid client during the Grace Interval, the system issues and continues to issue the "Web Scraping Detected" violation until it reaches the number of requests specified in the Unsafe Interval setting. The Unsafe Interval setting specifies the number of requests that the ASM considers unsafe. Much like in the Safe Interval, after the client sends the number of requests specified in the Unsafe Interval setting, the system reactivates the Grace Interval and begins the process again. The default setting for the Unsafe Interval is 100 requests.

The following figure shows the settings for Bot Detection and the values associated with each setting.

Interval Timing

The following picture shows a timeline of client requests and the intervals associated with each request. In the example, the first 100 client requests will fall into the Grace Interval, and during this interval the ASM will be determining whether or not the client is a bot.

Let's say a bot is detected at client request 100. Then, the ASM will immediately invoke the Unsafe Interval and the next 100 requests will be issued a "Web Scraping Detected" violation. When the Unsafe Interval is complete, the ASM reverts back to the Grace Interval.

If, during the Grace Interval, the system determines that the client is a human, it does not check the subsequent requests at all (during the Safe Interval). Once the Safe Interval is complete, the system moves back into the Grace Interval and the process continues.

Notice that the ASM is able to detect a bot before the Grace Interval is complete (as shown in the latter part of the diagram below). As soon as the system detects a bot, it immediately moves into the Unsafe Interval...even if the Grace Interval has not reached its set threshold.

Setting Thresholds

As you can see from the timeline above, it's important to establish the correct thresholds for each interval setting. The longer you make the Grace Interval, the longer you give the ASM a chance to detect a bot, but keep in mind that the processing overhead can become expensive. Likewise, the Unsafe Interval setting is a great feature, but if you set it too high, your requesting clients will have to sit through a long period of violation notices before they can access your site again. Finally, the Safe Interval setting allows your users free and open access to your site. If you set this too low, you will force the system to cycle through the Grace Interval unnecessarily, but if you set it too high, a bot might have a chance to sneak past the ASM defense and scrape your site. Remember, the ASM does not check client requests at all during the Safe Interval.

Also, remember the ASM does not perform web scraping detection on traffic from search engines that the system recognizes as being legitimate. If your web application has its own search engine, it's recommended that you add it to the system. Go to Security > Options > Application Security > Advanced Configuration > Search Engines and add it to the list (the ASM comes preconfigured with Ask, Bing, Google, and Yahoo already loaded).

A Quick Test

I loaded up a sample virtual web server (in this case it was a fictitious online auction site) and then configured the ASM to Alarm and Block bot activity on the site. Then, I fired up the iMacro plugin from Firefox to scrape the site. Using the iMacro plugin, I sent many client requests in a short amount of time. After several requests to the site, I received the response page shown below. You can see that the response page settings in the ASM are shown in the Firefox browser window when web scraping is detected. These settings can be adjusted in the ASM by navigating to Security > Application Security > Blocking > Response Pages.

Well, thanks for coming back and reading about ASM bot detection. Be sure to swing by again for the final web scraping article where I will discuss session anomalies. I can't possibly think of anything that could be more fun than that!