How to Identify and Manage Scrapers (Pt. 2)

Introduction

Welcome back to part two of the article on how to identify and manage scrapers. While part one focused on ways to identify and detect scrapers, part two will highlight various approaches to prevent, manage, and reduce scraping.

9 Ways to Manage Scrapers

We'll start by highlighting some of the top methods used to manage scrapers in order to help you find the method best suited for your use case.

1. Robots.txt

The robots.txt file on a website contains rules for bots and scrapers, but it lacks enforcement power. Often, scrapers ignore these rules, scraping data they want. Other scraper management techniques are needed to enforce compliance and prevent scrapers from ignoring these rules.

2. Site, App, and API Design to Limit Data Provided to Bare Minimum

To manage scrapers, remove access to desired data, which may not always be feasible due to business-critical requirements. Designing websites, mobile apps, and APIs to limit or remove exposed data effectively reduces unwanted scraping.

3. CAPTCHA/reCAPTCHA

CAPTCHAs (including reCAPTCHA and other tests) are used to manage and mitigate scrapers by presenting challenges to prove human identity. Passing these tests grants access to data. However, they cause friction and decrease conversion rates. With advancements in recognition, computer vision, and AI, scrapers and bots have become more adept at solving CAPTCHAs, making them ineffective against more sophisticated scrapers.

4. Honey Pot Links

Scrapers, unlike humans, can see hidden elements on a web page, such as form fields and links. Security teams and web designers can add these to web pages, allowing them to respond to transactions performed by scrapers, such as forwarding them to a honeypot or providing incomplete results.

5. Require All Users to be Authenticated

Most scraping occurs without authentication, making it difficult to enforce access limits. To improve control, all users should be authenticated before data requests. Less motivated scrapers may avoid creating accounts, while sophisticated scrapers may resort to fake account creation. F5 Labs published an entire article series focusing on fake account creation bots. These skilled scrapers distribute data requests among fake accounts, adhering to account-level request limits. Implementing authentication measures could discourage less-motivated scrapers and improve data security.

6. Cookie/Device Fingerprint-Based Controls

To limit user requests, cookie-based tracking or device/TLS fingerprinting can be used, but they are invisible to legitimate users and can't be used for all users. Challenges include cookie deletion, collisions, and divisions. Advanced scrapers using tools like Browser Automation Studio (BAS) have anti-fingerprint capabilities including fingerprint switching, which can help them bypass these types of controls.

Figure 1: BAS website highlighting browser fingerprint switching features

7. WAF Based Blocks and Rate Limits (UA and IP)

Web Application Firewalls (WAFs) manage scrapers by creating rules based on user agent strings, headers, and IP addresses, but are ineffective against sophisticated scrapers who use common user agent strings, large numbers of IP addresses, and common header orders, making WAFs ineffective.

8. Basic Bot Defense

Basic bot defense solutions use JavaScript, CAPTCHA, device fingerprinting, and user behavior analytics to identify scrapers. They don't obfuscate signals collection scripts, encrypt, or randomize them, making it easy for sophisticated scrapers to reverse engineer. IP reputation and geo-blocking are also used. However, these solutions can be bypassed using new generation automation tools like BAS and puppeteer, or using high-quality proxy networks with high reputation IP addresses. Advanced scrapers can easily craft spoofed packets to bypass the defense system.

9. Advanced Bot Defense

Advanced enterprise-grade bot defense solutions use randomized, obfuscated signals collection to prevent reverse engineering and tamper protection. They use encryption and machine learning (ML) to build robust detection and mitigation systems. These solutions are effective against sophisticated scrapers, including AI companies, and adapt to varying automation techniques, providing long-term protection against both identified and unidentified scrapers.

Scraper Management Methods/Controls Comparison and Evaluation

Table 1 (below) evaluates scraper management methods and controls, providing a rating score (out of 5) for each, with higher scores indicating more effective control.

Control

Pros

Cons

Rating

Robot.txt

+  Cheap

+  Easy to implement

+  Effective against ethical bots

-  No enforcement

-  Ignored by most scrapers

1

Application redesign

+  Cheap

-  Not always feasible due to business need

1.5

CAPTCHA

+  Cheap

+  Easy to implement

-  Not always feasible due to business need

1.5

Honey pot links

+  Cheap

+  Easy to implement

-  Easily bypassed by more sophisticated scrapers

1.5

Require authentication

+  Cheap

+  Easy to implement

+  Effective against less motivated scrapers

-  Not always feasible due to business need

-  Results in a fake account creation problem

1.5

Cookie/fingerprint based controls

+  Cheaper than other solutions

+  Easier to implement

+  Effective against low sophistication scrapers

-  High risk of false positives from collisions

-  Ineffective against high to medium sophistication scrapers

2

Web Application Firewall

+  Cheaper than other solutions

+  Effective against low to medium sophistication scrapers

-  High risk of false positives from UA, header or IP based rate limits

-  Ineffective against high to medium sophistication scrapers

2.5

Basic bot defense

+  Effective against low to medium sophistication scrapers

-  Relatively expensive

-  Ineffective against high sophistication scrapers

-  Poor long term efficacy

-  Complex to implement and manage

3.5

Advanced bot defense

+  Effective against the most sophisticated scrapers

+  Long term efficacy

-  Expensive

-  Complex to implement and manage

5

Conclusion

There are many methods of identifying and managing scrapers, as highlighted above, each with its pros and cons. Advanced bot defense solutions, though costly and complex, are the most effective against all levels of scraper sophistication. To read the full article in its entirety, including more detail on all the management options described here, head over to our post on F5 Labs.

Updated Sep 20, 2024
Version 2.0