These Are Not The Scrapes You're Looking For - Session Anomalies

In my first article in this series, I discussed web scraping -- what it is, why people do it, and why it could be harmful.  My second article outlined the details of bot detection and how the ASM blocks against these pesky little creatures.  This last article in the series of web scraping will focus on the final part of the ASM defense against web scraping:  session opening anomalies and session transaction anomalies.  These two detection modes are new in v11.3, so if you're using v11.2 or earlier, then you should upgrade and take advantage of these great new features!

 

ASM Configuration

 

In case you missed it in the bot detection article, here's a quick screenshot that shows the location and settings of the Session Opening and Session Transactions Anomaly in the ASM.  You'll find all the fun when you navigate to Security > Application Security > Anomaly Detection > Web Scraping.  There are three different settings in the ASM for Session Anomaly: Off, Alarm, and Alarm and Block.  (Note: these settings are configured independently...they don't have to be set at the same value)

Obviously, if Session Anomaly is set to "Off" then the ASM does not check for anomalies at all.  The "Alarm" setting will detect anomalies and record attack data, but it will allow the client to continue accessing the website.  The "Alarm and Block" setting will detect anomalies, record the attack data, and block the suspicious requests.

 

 

 

Session Opening Anomaly

 

The first detection and prevention mode we'll discuss is Session Opening Anomaly.  But before we get too deep into this, let's review what a session is.  From a simple perspective, a session begins when a client visits a website, and it ends when the client leaves the site (or the client exceeds the session timeout value).  Most clients will visit a website, surf around some links on the site, find the information they need, and then leave.  When clients don't follow a typical browsing pattern, it makes you wonder what they are up to and if they are one of the bad guys trying to scrape your site.  That's where Session Opening Anomaly defense comes in!

Session Opening Anomaly defense checks for lots of abnormal activities like clients that don't accept cookies or process JavaScript, clients that don't scrape by surfing internal links in the application, and clients that create a one-time session for each resource they consume.  These one-time sessions lead scrapers to open a large number of new sessions in order to complete their job quickly. 

 

What's Considered A New Session?

 

Since we are discussing session anomalies, I figured we should spend a few sentences on describing how the ASM differentiates between a new or ongoing session for each client request.  Each new client is assigned a "TS cookie" and this cookie is used by the ASM to identify future requests from the client with a known, ongoing session.  If the ASM receives a client request and the request does not contain a TS cookie, then the ASM knows the request is for a new session.  This will prove very important when calculating the values needed to determine whether or not a client is scraping your site.

 

Detection

 

There are two different methods used by the ASM to detect these anomalies.  The first method compares a calculated value to a predetermined ceiling value for newly opened sessions.  The second method considers the rate of increase of newly opened sessions.  We'll dig into all that in just a minute.  But first, let's look at the criteria used for detecting these anomalies.  As you can see from the screenshot above, there are three detection criteria the ASM uses...they are:

  1. Sessions opened per second increased by:  This specifies that the ASM considers client traffic to be an attack if the number of sessions opened per second increases by a given percentage. The default setting is 500 percent.
  2. Sessions opened per second reached:  This specifies that the ASM considers client traffic to be an attack if the number of sessions opened per second is greater than or equal to this number. The default value is 400 sessions opened per second.
  3. Minimum sessions opened per second threshold for detection: This specifies that the ASM considers traffic to be an attack if the number of sessions opened per second is greater than or equal to the number specified.  In addition, at least one of the "Sessions opened per second increased by" or "Sessions opened per second reached" numbers must also be reached. If the number of sessions opened per second is lower than the specified number, the ASM does not consider this traffic to be an attack even if one of the "Sessions opened per second increased by" or "Sessions opened per second" reached numbers was reached. The default value for this setting is 200 sessions opened per second.

In addition, the ASM maintains two variables for each client IP address: a one-minute running average of new session opening rate, and a one-hour running average of new session opening rate.  Both of these variables are recalculated every second.

 

Now that we have all the basic building blocks. let's look at how the ASM determines if a client is scraping your site. 

First Method: Predefined Ceiling Value

This method uses the user-defined "minimum sessions opened per second threshold for detection" value and compares it to the one-minute running average.  If the one-minute average is less than this number, then nothing else happens because the minimum threshold has not been met.  But, if the one-minute average is higher than this number, the ASM goes on to compare the one-minute average to the user-defined "sessions opened per second reached" value.  If the one-minute average is less than this value, nothing happens.  But, if the one-minute average is higher than this value, the ASM will declare the client a web scraper.  The following flowchart provides a pictorial representation of this process.

 

 

 

Second Method: Rate of Increase

The second detection method uses several variables to compare the rate of increase of newly opened sessions against user-defined variables.  Like the first method, this method first checks to make sure the minimum sessions opened per second threshold is met before doing anything else.  If the minimum threshold has been met, the ASM will perform a few more calculations to determine if the client is a web scraper or not.  The "sessions opened per second increased by" value (percentage) is multiplied by the one-hour running average and this value is compared to the one-minute running average.  If the one-minute average is greater, then the ASM declares the client a web scraper.  If the one-minute average is lower, then nothing happens.  The following matrix shows a few examples of this detection method.  Keep in mind that the one-minute and one-hour averages are recalculated every second, so these values will be very dynamic.

 

 

 

Prevention

 

The ASM provides several policies to prevent session opening anomalies.  It begins with the first method that you enable in this list. If the system finds this method not effective enough to stop the attack, it uses the next method that you enable in this list. The following screenshots show the different options available for prevention.  The "Drop IP Addresses with bad reputation" is tied to Rate Limiting, so it will not appear as an option unless you enable Rate Limiting.  Note that IP Address Intelligence must be licensed and enabled.  This feature is licensed separately from the other ASM web scraping options.

 

 

 

Here's a quick breakdown of what each of these prevention policies do for you:

  • Client Side Integrity Defense: The system determines whether the client is a legal browser or an illegal script by sending a JavaScript challenge to each new session request from the detected IP address, and waiting for a response.  The JavaScript challenge will typically involve some sort of computational challenge.  Legal browsers will respond with a TS cookie while illegal scripts will not.  The default for this feature is disabled.
  • Rate Limiting: The goal of Rate Limiting is to keep the volume of new sessions at a "non-attack" level.  The system will drop sessions from suspicious IP addresses after the system determines that the client is an illegal script.  The default for this feature is also disabled.
  • Drop IP Addresses with bad reputation: The system drops requests from IP addresses that have a bad reputation according to the system’s IP Address Intelligence database (shown above).  The ASM will drop all request from any "bad" IP addresses even if they respond with a TS cookie.  IP addresses that do not have a bad reputation also undergo rate limiting.  The default for this option is disabled.  Keep in mind that this option is available only after Rate Limiting is enabled.  In addition, this option is only enforced if at least one of the IP Address Intelligence Categories is set to Alarm mode.

 

Prevention Duration

Now that we have detected session opening anomalies and mitigated them using our prevention options, we must figure out how long to apply the prevention measures.  This is where the Prevention Duration comes in.  This setting specifies the length of time that the system will prevent an attack. The system prevents attacks by rejecting requests from the attacking IP address. There are two settings for Prevention Duration:

  1. Unlimited: This specifies that after the system detects and stops an attack, it performs attack prevention until it detects the end of the attack.  This is the default setting.
  2. Maximum <number of> seconds: This specifies that after the system detects and stops an attack, it performs attack prevention for the amount of time indicated unless the system detects the end of the attack earlier. 

 

So, to finish up our Session Opening Anomaly part of this article, I wanted to share a quick scenario.  I was recently reading several articles from some of the web scrapers around the block, and I found one guy's solution to work around web scraping defense.  Here's what he said: "Since the service conducted rate-limiting based on IP address, my solution was to put the code that hit their service into some client-side JavaScript, and then send the results back to my server from each of the clients.  This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit."

This guy is really smart!  And, this would work great against a web scraping defense that only offered a Rate Limiting feature.  Here's the pop quiz question:  If a user were to deploy this same tactic against the ASM, what would you do to catch this guy?  I'm thinking you would need to set your minimum threshold at an appropriate level (this will ensure the ASM kicks into gear when all these sessions are opened) and then the "sessions opened per second" or the "sessions opened per second increased by" should take care of the rest for you.  As always, it's important to learn what each setting does and then test it on your own environment for a period of time to ensure you have everything tuned correctly.  And, don't forget to revisit your settings from time to time...you will probably need to change them as your network environment changes.

 

 

Session Transactions Anomaly

 

The second detection and prevention mode is Session Transactions Anomaly.  This mode specifies how the ASM reacts when it detects a large number of transactions per session as well as a large increase of session transactions.  Keep in mind that web scrapers are designed to extract content from your website as quickly and efficiently as possible.  So, web scrapers normally perform many more transactions than a typical application client.  Even if a web scraper found a way around all the other defenses we've discussed, the Session Transaction Anomaly defense should be able to catch it based on the sheer number of transactions it performs during a given session.  The ASM detects this activity by counting the number of transactions per session and comparing that number to a total average of transactions from all sessions.  The following screenshot shows the detection and prevention criteria for Session Transactions Anomaly.

 

 

 

Detection

 

How does the ASM detect all this bad behavior?  Well, since it's trying to find clients that surf your site much more than other clients, it tracks the number of transactions per client session (note: the ASM will drop a session from the table if no transactions are performed for 15 minutes).  It also tracks the average number of transactions for all current sessions (note: the ASM calculates the average transaction value every minute).  It can use these two figures to compare a specific client session to a reasonable baseline and figure out if the client is performing too many transactions.  The ASM can automatically figure out the number of transactions per client, but it needs some user-defined thresholds to conduct the appropriate comparisons.  These thresholds are as follows:

Session transactions increased by: This specifies that the system considers traffic to be an attack if the number of transactions per session increased by the percentage listed. The default setting is 500 percent.

Session transactions reached: This specifies that the system considers traffic to be an attack if the number of transactions per session is equal to or greater than this number. The default value is 400 transactions.

Minimum session transactions threshold for detection: This specifies that the system considers traffic to be an attack if the number of transactions per session is equal to or greater than this number, and at least one of the "Sessions transactions increased by" or "Session transactions reached" numbers was reached.  If the number of transactions per session is lower than this number, the system does not consider this traffic to be an attack even if one of the "Session transactions increased by" or "Session transaction reached" numbers was reached. The default value is 200 transactions.

The following table shows an example of how the ASM calculates transaction values (averages and individual sessions).

 

We would expect that a given client session would perform about the same number of transactions as the overall average number of transactions per session.  But, if one of the sessions is performing a significantly higher number of transactions than the average, then we start to get suspicious.  You can see that session 1 and session 3 have transaction values higher than the average, but that only tells part of the story.  We need to consider a few more things before we decide if this client is a web scraper or not.  By the way, if the ASM knows that a given session is malicious, it does not use that session's transaction numbers when it calculates the average.

Now, let's roll in the threshold values that we discussed above.  If the ASM is going to declare a client as a web scraper using the session transaction anomaly defense, the session transactions must first reach the minimum threshold.  Using our default minimum threshold value of 200, the only session that exceeded the minimum threshold is session 3 (250 > 200).  All other sessions look good so far...keep in mind that these numbers will change as the client performs additional transactions during the session, so more sessions may be considered as their transaction numbers increase.

Since we have our eye on session 3 at this point, it's time to look at our two methods of detecting an attack. 

The first detection method is a simple comparison of the total session transaction value to our user-defined "session transactions reached" threshold.  If the total session transactions is larger than the threshold, the ASM will declare the client a web scraper. 

Our example would look like this:

Is session 3 transaction value >  threshold value (250 > 400)?  No, so the ASM does not declare this client as a web scraper.

 

The second detection method uses the "transactions increased by" value along with the average transaction value for all sessions.  The ASM multiplies the average transaction value with the "transactions increased by" percentage to calculate the value needed for comparison. 

Our example would look like this: 

90 * 500% = 450 transactions

Is session 3 transaction value > result (250 > 450)?  No, so the ASM does not declare this client as a web scraper. 

 

By the way, only one of these detection methods needs to be met for the ASM to declare the client as a web scraper.  You should be able to see how the user-defined thresholds are used in these calculations and comparisons.  So, it's important to raise or lower these values as you need for your environment.

 

Prevention Duration

In order to save you a bunch of time reading about prevention duration, I'll just say that the Session Transactions Anomaly prevention duration works the same as the Session Opening Anomaly prevention duration (Unlimited vs Maximum <number of> seconds).  See, that was easy!

 

Conclusion

 

Thanks for spending some time reading about session anomalies and web scraping defense.  The ASM does a great job of detecting and preventing web scrapers from taking your valuable information.  One more thing...for an informative anomaly discussion on the DevCentral Security Forum, check out this conversation.

 

If you have any questions about web scraping or ASM configurations, let me know...you can fill out the comment section below or you can contact the DevCentral team at https://devcentral.f5.com/s/community/contact-us.

Published May 29, 2013
Version 1.0
  • I would like to know if the ASM web scraping feature would help us out in the following. We have bots that could use thousands off customer account to login and scrape our lines pages and their request could come from multiple /16 /24 ip ranges. Also their able to bypass some client side java by using browser like module in for example python – phantomjs – selenium – splash etc . So we want to know if F5 ASM could fingerprint and detect these types off bots/web-scrapers which is not the normal type of bots and take action even if they change user-agent string or spoof true-client ip headers – refer etc . Most important without affecting performance and with a low rate of false positive. Mau