What Are Scrapers and Why Should You Care?

Introduction

Scrapers are automated tools designed to extract data from websites and APIs for various purposes, posing significant threats to organizations of all sizes. They can lead to intellectual property theft, competitive advantage erosion, website/API performance degradation, and legal liabilities. Scraping is one of the top 10 automated threats by OWASP, defined as using automation to collect application content and/or other data for use elsewhere. It impacts businesses across various industries and its legal status varies depending on geographic and legal jurisdictions.

What is Scraping?

Scraping involves requesting web pages, loading them, and parsing the HTML to extract the desired data and content. Examples of heavily scraped items include:

Flights
Hotel rooms
Retail product prices
Insurance rates
Credit and mortgage interest rates
Contact lists
Store locations
User profiles

Scrapers use automation to make many smaller requests and put the data together in pieces, often with tens of thousands or even millions of individual requests.

In the 2024 Bad Bots Review by F5 Labs, scraping bots were responsible for high levels of automation on two of the three most targeted flows, Search and Quotes, throughout 2023 across the entire F5 Bot Defense network. See figure 1 below. In addition, up to 70% of all search traffic originates from scrapers without advanced bot defense solutions. This percentage is based on the numerous proof of concept analyses done for enterprises with no advanced bot controls in place.

Scraper versus Crawler or Spider

Scrapers are different from crawlers or spiders in that they are mostly designed to get data and content from a website or API. Crawlers and spiders are used to list websites for search engines. Scrapers are designed to extract and exfiltrate data and content from the website or API, which can then be reused, resold, and otherwise repurposed as the scraper intends.

Scraping is typically in violation of the terms and conditions of most websites and APIs, with some cases overturning previous rulings. Most scrapers target information on the web, but activity against APIs is on the rise.

Business Models for Scraping

There are many different parties active in the scraping business, with different business models and incentives for scraping content and data. Figure 2 below provides an overview of the various sources of scraping activity.

The scraping industry involves various parties with different business models and incentives for scraping content and data. Search engine companies, such as Google, Bing, Facebook, Amazon, and Baidu, index content from websites to help users find things on the internet. Their business model is selling ads placed alongside search results.

Competitors scrape content and data from each other to win customers, market share, and revenue. They use scraping to increase market share, competitive pricing, network scraping, inventory scraping, researchers, and investment firms, intellectual property owners, data aggregators, news aggregators, and AI companies.

Competitors scrape pricing and availability of competitor products to win increased market share. Network scraping involves scraping the names, addresses, and contact details of a company's network partners, such as repair shops, doctors, hospitals, clinics, insurance agents, and brokers. Inventory scraping involves stealing valuable content and data from a competing site for use on their own site.

Researchers and investment firms use scraping to gather data for their research and generate revenue by publishing and selling the results of their market research. Intellectual property owners use scraping to identify possible trademark or copyright infringements and ensure compliance with pricing and discounting guidelines.

Data aggregators collect and aggregate data from various sources and sell it to interested parties. Some specialize in specific industries, while others use scrapers to pull news feeds, blogs, articles, and press releases from various websites and APIs.

Artificial Intelligence (AI) companies scrape data across various industries, often without identifying themselves. As the AI space continues to grow, scraping traffic is expected to increase.

Criminal organizations often scrape websites or applications for various malicious purposes--including phishing, vulnerability scanning, identity theft, and intermediation. Criminals use scrapers to create replicas of the victim’s website or app, requiring users to provide personal information (PII). They also use scrapers to test for vulnerabilities in the website or application, such as allowing them to access discounted rates or back-end systems.

Costs of Scraping

Direct costs of scraping include infrastructure costs, server performance, and outages, loss of revenue and market share, and intermediary-driven intermediation. Companies prefer direct relationships with customers for selling and marketing, customer retention, cross-selling, and upselling, and customer experience. However, indirect costs include loss of investment, intellectual property, reputational damage, legal liability, and questionable practices.

Scraping can lead to a loss of revenue, profits, market share, and customer satisfaction. Indirect costs include the loss of intellectual property, reputational damage, legal liability, and questionable practices. Companies may lose control over the end-to-end customer experience when intermediaries are involved, leading to dissatisfied customers.

Conclusion

Scraping is a significant issue that affects enterprises worldwide in various industries. F5 Labs' research shows that almost 1 in 5 search and quote transactions are generated by scrapers. It is usually done by various entities, including search engines, competitors, AI companies, and malicious third parties. These costs result in the loss of revenue, profits, market share, and customer satisfaction.

For a deeper dive into the impact of scraping on enterprises and effective mitigation strategies, read the full article on F5 Labs.

Published Aug 19, 2024

Version 1.0

f5 labs

security