How to Identify and Manage Scrapers (Pt. 1)
Introduction
The latest addition in our Scraper series focuses on how to identify and manage scrapers, but we’ll be splitting up the article into two parts. Part one will focus on outlining ways to identify and detect scrapers, while part two will focus on tactics to help manage scrapers.
How to Identify Scraping Traffic
The first step in identifying scraping traffic involves detecting various methods based on the scraper’s motivations and approaches. Some scrapers, like benign search bots, self-identify for network and security permission. Others, like AI companies, competitors, and malicious scrapers, hide themselves, making detection difficult. More sophisticated approaches are needed to combat these types of scrapers.
Self-Identifying Scrapers
There are several scrapers that announce themselves and make it very easy to identify them. These bots self-identify using the HTTP user agent string, indicating explicit permission or belief in providing valuable service. These bots can be classified into three categories.
- Search Engine Bots/Crawlers
- Performance or Security Monitoring
- Archiving
Several scraper websites offer detailed information on their scrapers, including identification, IP addresses, and opt-out options. It's crucial to review these documents for scrapers of interest, as unscrupulous scrapers often impersonate known ones. Websites often provide tools to verify if a scraper is real or an imposter. Links to these documentation and screenshots are provided in our full blog on F5 Labs.
Many scrapers identify themselves via the user agent string. A string is usually added to the user-agent string that contains the following.
- The name of the company, service or tool that is doing the scraping
- A website address for the company, service or tool that is doing the scraping
- A contact email for the administrator of the entity doing the scraping
- Other text explaining what the scraper is doing or who they are
A key way to identify self-identifying scrapers is to search the user-agent field in your server logs for specific strings. Table 1 below outlines common strings you can look for.
Self Identification method |
Search String |
Name of the tool or service |
*Bot * or *bot* |
Website address |
*www* or *.com* |
Contact Email |
*@* |
Examples of User Agent Strings
OpenAI searchbot user agent string:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI- SearchBot/1.0; +https://openai.com/searchbot
Bing search bot user agent string:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; + http://www.bing.com/bingbot.htm) Chrome/
These scrapers have both the name of the tool or service, as well as the website in the user-agent string and can be identified using two of the methods highlighted in Table 1 above.
Impersonation
Because user agents are self-reported, they are easily spoofed. Any scraper can pretend to be a known entity like Google bot by simply presenting the Google bot user agent string. We have observed countless examples of fake bots impersonating large known scrapers like Google, Bing and Facebook.
As one example, Figure 1 below shows here the traffic overview of a fake Google scraper bot. This scraper was responsible for almost a hundred thousand requests per day against a large US hotel chain’s room search endpoints.
The bot used the following user-agent string, which is identical to the one used by the real Google bot.
Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) (765362)
IP-based Identification
Scrapers can identify themselves via their IP addresses. Whois lookups can reveal the IP address of a scraper, their organization, or their registered ASNs. While not revealing the identity of the actual entity, they can be useful in certain cases. Geolocation information can also be used to identify automated scraping activity.
Reverse DNS lookups help identify the scraper’s identity by using the Domain Name System (DNS) to find the domain name associated with an IP address, which can be identified by using free online reverse DNS lookup services. Since IP address spoofing is non-trivial, identifying and allowlisting scrapers using IP addresses is more secure than simply using user agents.
Artificial Intelligence (AI) Scrapers
Artificial intelligence companies are increasingly using internet scraping to train models, causing a surge in data scraping. This data is often used for for-profit AI services, which sometimes compete with scraping victims. Several lawsuits are currently underway against these companies.
A California class-action lawsuit has been filed by 16 claimants against OpenAI, alleging copyright infringement due to the scraping and use of their data for model training.
Due to all the sensitivity around AI companies scraping data from the internet, a few things have happened.
- Growing scrutiny of these companies has forced them to start publishing details of their scraping activity and ways to both identify these AI scrapers as well as ways to opt out of your applications being scraped.
- AI companies have seen an increase in opt-outs from AI scraping, resulting in them being unable to access the data needed to power their apps. Some less ethical AI companies have since set up alternative “dark scrapers” which do not self-identify, and instead secretly continue to scrape the data needed to power their AI services.
Unidentified Scrapers
Most scrapers don't identify themselves or request explicit permission, leaving application, network, and security teams unaware of their activities on Web, Mobile, and API applications. Identifying scrapers can be challenging, but below you'll find two techniques that we have used in the past that can help identify the organization or actors behind them. To view additional techniques along with an in-depth explanation of each, head over to our blog post on F5 Labs.
1. Requests for Obscure or Non-Existent Resources
Website scrapers crawl obscure or low-volume pages, requesting resources like flight availability and pricing. They construct requests manually, sending them directly to airline origin servers. Figure 2 shows an example of a scraper that was scraping an airline’s flights and requesting flights to and from a train station.
2. IP Infrastructure Analysis, Use of Hosting Infra or Corporate IP Ranges (Geo Location Matching)
Scrapers distribute traffic via proxy networks or bot nets to avoid IP-based rate limits, making it easier to identify them. Some of these tactics include:
- Round-robin IP or UA usage
- Use of hosting IPs
- Use of low-reputations IPs
- Use of international IPs that do not match expected user locations
The following are additional things to keep in mind when trying to identify scrapers. We provide an in-depth overview of each in our full article on F5 Labs.
- Conversion or look-to-book analysis
- Not downloading or fetching images and dependencies but just data
- Behavior/session analysis
Conclusion
We discussed two methods above that might be helpful in identifying a scraper. However, keep in mind that it's crucial to take into account the type of scraper and the sort of data they are targeting in order to correctly identify it.
To read the full article on identifying scrapers, which includes more identification methods, head on over to our post on F5 Labs. Otherwise, continue on to part two where we’ll outline tactics to help manage scrapers.