f5 labs
16 TopicsHow to Identify and Manage Scrapers (Pt. 2)
Introduction Welcome back to part two of the article on how to identify and manage scrapers. While part one focused on ways to identify and detect scrapers, part two will highlight various approaches to prevent, manage, and reduce scraping. 9 Ways to Manage Scrapers We'll start by highlighting some of the top methods used to manage scrapers in order to help you find the method best suited for your use case. 1. Robots.txt The robots.txt file on a website contains rules for bots and scrapers, but it lacks enforcement power. Often, scrapers ignore these rules, scraping data they want. Other scraper management techniques are needed to enforce compliance and prevent scrapers from ignoring these rules. 2. Site, App, and API Design to Limit Data Provided to Bare Minimum To manage scrapers, remove access to desired data, which may not always be feasible due to business-critical requirements. Designing websites, mobile apps, and APIs to limit or remove exposed data effectively reduces unwanted scraping. 3. CAPTCHA/reCAPTCHA CAPTCHAs (including reCAPTCHA and other tests) are used to manage and mitigate scrapers by presenting challenges to prove human identity. Passing these tests grants access to data. However, they cause friction and decrease conversion rates. With advancements in recognition, computer vision, and AI, scrapers and bots have become more adept at solving CAPTCHAs, making them ineffective against more sophisticated scrapers. 4. Honey Pot Links Scrapers, unlike humans, can see hidden elements on a web page, such as form fields and links. Security teams and web designers can add these to web pages, allowing them to respond to transactions performed by scrapers, such as forwarding them to a honeypot or providing incomplete results. 5. Require All Users to be Authenticated Most scraping occurs without authentication, making it difficult to enforce access limits. To improve control, all users should be authenticated before data requests. Less motivated scrapers may avoid creating accounts, while sophisticated scrapers may resort to fake account creation. F5 Labs published an entire article series focusing on fake account creation bots. These skilled scrapers distribute data requests among fake accounts, adhering to account-level request limits. Implementing authentication measures could discourage less-motivated scrapers and improve data security. 6. Cookie/Device Fingerprint-Based Controls To limit user requests, cookie-based tracking or device/TLS fingerprinting can be used, but they are invisible to legitimate users and can't be used for all users. Challenges include cookie deletion, collisions, and divisions. Advanced scrapers using tools like Browser Automation Studio (BAS) have anti-fingerprint capabilities including fingerprint switching, which can help them bypass these types of controls. 7. WAF Based Blocks and Rate Limits (UA and IP) Web Application Firewalls (WAFs) manage scrapers by creating rules based on user agent strings, headers, and IP addresses, but are ineffective against sophisticated scrapers who use common user agent strings, large numbers of IP addresses, and common header orders, making WAFs ineffective. 8. Basic Bot Defense Basic bot defense solutions use JavaScript, CAPTCHA, device fingerprinting, and user behavior analytics to identify scrapers. They don't obfuscate signals collection scripts, encrypt, or randomize them, making it easy for sophisticated scrapers to reverse engineer. IP reputation and geo-blocking are also used. However, these solutions can be bypassed using new generation automation tools like BAS and puppeteer, or using high-quality proxy networks with high reputation IP addresses. Advanced scrapers can easily craft spoofed packets to bypass the defense system. 9. Advanced Bot Defense Advanced enterprise-grade bot defense solutions use randomized, obfuscated signals collection to prevent reverse engineering and tamper protection. They use encryption and machine learning (ML) to build robust detection and mitigation systems. These solutions are effective against sophisticated scrapers, including AI companies, and adapt to varying automation techniques, providing long-term protection against both identified and unidentified scrapers. Scraper Management Methods/Controls Comparison and Evaluation Table 1 (below) evaluates scraper management methods and controls, providing a rating score (out of 5) for each, with higher scores indicating more effective control. Control Pros Cons Rating Robot.txt +Cheap +Easy to implement +Effective against ethical bots -No enforcement -Ignored by most scrapers 1 Application redesign +Cheap -Not always feasible due to business need 1.5 CAPTCHA +Cheap +Easy to implement -Not always feasible due to business need 1.5 Honey pot links +Cheap +Easy to implement -Easily bypassed by more sophisticated scrapers 1.5 Require authentication +Cheap +Easy to implement +Effective against less motivated scrapers -Not always feasible due to business need -Results in a fake account creation problem 1.5 Cookie/fingerprint based controls +Cheaper than other solutions +Easier to implement +Effective against low sophistication scrapers -High risk of false positives from collisions -Ineffective against high to medium sophistication scrapers 2 Web Application Firewall +Cheaper than other solutions +Effective against low to medium sophistication scrapers -High risk of false positives from UA, header or IP based rate limits -Ineffective against high to medium sophistication scrapers 2.5 Basic bot defense +Effective against low to medium sophistication scrapers -Relatively expensive -Ineffective against high sophistication scrapers -Poor long term efficacy -Complex to implement and manage 3.5 Advanced bot defense +Effective against the most sophisticated scrapers +Long term efficacy -Expensive -Complex to implement and manage 5 Conclusion There are many methods of identifying and managing scrapers, as highlighted above, each with its pros and cons. Advanced bot defense solutions, though costly and complex, are the most effective against all levels of scraper sophistication. To read the full article in its entirety, including more detail on all the management options described here, head over to our post on F5 Labs.36Views0likes1CommentHow to Identify and Manage Scrapers (Pt. 1)
Introduction The latest addition in our Scraper series focuses on how to identify and manage scrapers, but we’ll be splitting up the article into two parts. Part one will focus on outlining ways to identify and detect scrapers, while part two will focus on tactics to help manage scrapers. How to Identify Scraping Traffic The first step in identifying scraping traffic involves detecting various methods based on the scraper’s motivations and approaches. Some scrapers, like benign search bots, self-identify for network and security permission. Others, like AI companies, competitors, and malicious scrapers, hide themselves, making detection difficult. More sophisticated approaches are needed to combat these types of scrapers. Self-Identifying Scrapers There are several scrapers that announce themselves and make it very easy to identify them. These bots self-identify using the HTTP user agent string, indicating explicit permission or belief in providing valuable service. These bots can be classified into three categories. Search Engine Bots/Crawlers Performance or Security Monitoring Archiving Several scraper websites offer detailed information on their scrapers, including identification, IP addresses, and opt-out options. It's crucial to review these documents for scrapers of interest, as unscrupulous scrapers often impersonate known ones. Websites often provide tools to verify if a scraper is real or an imposter. Links to these documentation and screenshots are provided in our full blog on F5 Labs. Many scrapers identify themselves via the user agent string. A string is usually added to the user-agent string that contains the following. The name of the company, service or tool that is doing the scraping A website address for the company, service or tool that is doing the scraping A contact email for the administrator of the entity doing the scraping Other text explaining what the scraper is doing or who they are A key way to identify self-identifying scrapers is to search the user-agent field in your server logs for specific strings. Table 1 below outlines common strings you can look for. Table 1: Search strings to find self-identifying scrapers (* is a wildcard) Self Identification method Search String Name of the tool or service *Bot * or *bot* Website address *www* or *.com* Contact Email *@* Examples of User Agent Strings OpenAI searchbot user agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI- SearchBot/1.0; +https://openai.com/searchbot Bing search bot user agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; + http://www.bing.com/bingbot.htm) Chrome/ These scrapers have both the name of the tool or service, as well as the website in the user-agent string and can be identified using two of the methods highlighted in Table 1 above. Impersonation Because user agents are self-reported, they are easily spoofed. Any scraper can pretend to be a known entity like Google bot by simply presenting the Google bot user agent string. We have observed countless examples of fake bots impersonating large known scrapers like Google, Bing and Facebook. As one example, Figure 1 below shows here the traffic overview of a fake Google scraper bot. This scraper was responsible for almost a hundred thousand requests per day against a large US hotel chain’s room search endpoints. The bot used the following user-agent string, which is identical to the one used by the real Google bot. Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) (765362) IP-based Identification Scrapers can identify themselves via their IP addresses. Whois lookups can reveal the IP address of a scraper, their organization, or their registered ASNs. While not revealing the identity of the actual entity, they can be useful in certain cases. Geolocation information can also be used to identify automated scraping activity. Reverse DNS lookups help identify the scraper’s identity by using the Domain Name System (DNS) to find the domain name associated with an IP address, which can be identified by using free online reverse DNS lookup services. Since IP address spoofing is non-trivial, identifying and allowlisting scrapers using IP addresses is more secure than simply using user agents. Artificial Intelligence (AI) Scrapers Artificial intelligence companies are increasingly using internet scraping to train models, causing a surge in data scraping. This data is often used for for-profit AI services, which sometimes compete with scraping victims. Several lawsuits are currently underway against these companies. A California class-action lawsuit has been filed by 16 claimants against OpenAI, alleging copyright infringement due to the scraping and use of their data for model training. Due to all the sensitivity around AI companies scraping data from the internet, a few things have happened. Growing scrutiny of these companies has forced them to start publishing details of their scraping activity and ways to both identify these AI scrapers as well as ways to opt out of your applications being scraped. AI companies have seen an increase in opt-outs from AI scraping, resulting in them being unable to access the data needed to power their apps. Some less ethical AI companies have since set up alternative “dark scrapers” which do not self-identify, and instead secretly continue to scrape the data needed to power their AI services. Unidentified Scrapers Most scrapers don't identify themselves or request explicit permission, leaving application, network, and security teams unaware of their activities on Web, Mobile, and API applications. Identifying scrapers can be challenging, but below you'll find two techniques that we have used in the past that can help identify the organization or actors behind them. To view additional techniques along with an in-depth explanation of each, head over toour blog post on F5 Labs. 1. Requests for Obscure or Non-Existent Resources Website scrapers crawl obscure or low-volume pages, requesting resources like flight availability and pricing. They construct requests manually, sending them directly to airline origin servers. Figure 2 shows an example of a scraper that was scraping an airline’s flights and requesting flights to and from a train station. 2. IP Infrastructure Analysis, Use of Hosting Infra or Corporate IP Ranges (Geo Location Matching) Scrapers distribute traffic via proxy networks or bot nets to avoid IP-based rate limits, making it easier to identify them. Some of these tactics include: Round-robin IP or UA usage Use of hosting IPs Use of low-reputations IPs Use of international IPs that do not match expected user locations The following are additional things to keep in mind when trying to identify scrapers. We provide an in-depth overview of each in our full article on F5 Labs. Conversion or look-to-book analysis Not downloading or fetching images and dependencies but just data Behavior/session analysis Conclusion We discussed two methods above that might be helpful in identifying a scraper. However, keep in mind that it's crucial to take into account the type of scraper and the sort of data they are targeting in order to correctly identify it. To read the full article on identifying scrapers, which includes more identification methods, head on over toour post on F5 Labs. Otherwise, continue on to part two where we’ll outline tactics to help manage scrapers.47Views1like0CommentsWhat Are Scrapers and Why Should You Care?
Introduction Scrapers are automated tools designed to extract data from websites and APIs for various purposes, posing significant threats to organizations of all sizes. They can lead to intellectual property theft, competitive advantage erosion, website/API performance degradation, and legal liabilities. Scraping is one of the top 10 automated threats by OWASP, defined as using automation to collect application content and/or other data for use elsewhere. It impacts businesses across various industries and its legal status varies depending on geographic and legal jurisdictions. What is Scraping? Scraping involves requesting web pages, loading them, and parsing the HTML to extract the desired data and content. Examples of heavily scraped items include: Flights Hotel rooms Retail product prices Insurance rates Credit and mortgage interest rates Contact lists Store locations User profiles Scrapers use automation to make many smaller requests and put the data together in pieces, often with tens of thousands or even millions of individual requests. In the 2024 Bad Bots Review by F5 Labs, scraping bots were responsible for high levels of automation on two of the three most targeted flows, Search and Quotes, throughout 2023 across the entire F5 Bot Defense network. See figure 1 below. In addition, up to 70% of all search traffic originates from scrapers without advanced bot defense solutions. This percentage is based on the numerous proof of concept analyses done for enterprises with no advanced bot controls in place. Scraper versus Crawler or Spider Scrapers are different from crawlers or spiders in that they are mostly designed to get data and content from a website or API. Crawlers and spiders are used to list websites for search engines. Scrapers are designed to extract and exfiltrate data and content from the website or API, which can then be reused, resold, and otherwise repurposed as the scraper intends. Scraping is typically in violation of the terms and conditions of most websites and APIs, with some cases overturning previous rulings. Most scrapers target information on the web, but activity against APIs is on the rise. Business Models for Scraping There are many different parties active in the scraping business, with different business models and incentives for scraping content and data. Figure 2 below provides an overview of the various sources of scraping activity. The scraping industry involves various parties with different business models and incentives for scraping content and data. Search engine companies, such as Google, Bing, Facebook, Amazon, and Baidu, index content from websites to help users find things on the internet. Their business model is selling ads placed alongside search results. Competitors scrape content and data from each other to win customers, market share, and revenue. They use scraping to increase market share, competitive pricing, network scraping, inventory scraping, researchers, and investment firms, intellectual property owners, data aggregators, news aggregators, and AI companies. Competitors scrape pricing and availability of competitor products to win increased market share. Network scraping involves scraping the names, addresses, and contact details of a company's network partners, such as repair shops, doctors, hospitals, clinics, insurance agents, and brokers. Inventory scraping involves stealing valuable content and data from a competing site for use on their own site. Researchers and investment firms use scraping to gather data for their research and generate revenue by publishing and selling the results of their market research. Intellectual property owners use scraping to identify possible trademark or copyright infringements and ensure compliance with pricing and discounting guidelines. Data aggregators collect and aggregate data from various sources and sell it to interested parties. Some specialize in specific industries, while others use scrapers to pull news feeds, blogs, articles, and press releases from various websites and APIs. Artificial Intelligence (AI) companies scrape data across various industries, often without identifying themselves. As the AI space continues to grow, scraping traffic is expected to increase. Criminal organizations often scrape websites or applications for various malicious purposes--including phishing, vulnerability scanning, identity theft, and intermediation. Criminals use scrapers to create replicas of the victim’s website or app, requiring users to provide personal information (PII). They also use scrapers to test for vulnerabilities in the website or application, such as allowing them to access discounted rates or back-end systems. Costs of Scraping Direct costs of scraping include infrastructure costs, server performance, and outages, loss of revenue and market share, and intermediary-driven intermediation. Companies prefer direct relationships with customers for selling and marketing, customer retention, cross-selling, and upselling, and customer experience. However, indirect costs include loss of investment, intellectual property, reputational damage, legal liability, and questionable practices. Scraping can lead to a loss of revenue, profits, market share, and customer satisfaction. Indirect costs include the loss of intellectual property, reputational damage, legal liability, and questionable practices. Companies may lose control over the end-to-end customer experience when intermediaries are involved, leading to dissatisfied customers. Conclusion Scraping is a significant issue that affects enterprises worldwide in various industries. F5 Labs' research shows that almost 1 in 5 search and quote transactions are generated by scrapers. It is usually done by various entities, including search engines, competitors, AI companies, and malicious third parties. These costs result in the loss of revenue, profits, market share, and customer satisfaction. For a deeper dive into the impact of scraping on enterprises and effective mitigation strategies, read the full article on F5 Labs.88Views2likes0CommentsSIS March 2024: TP-Link Archer AX21 Wifi Router targeting, plus a handful of new CVEs!
The March 2024 Sensor Intelligence Series report highlights a significant surge in scanning activity for the vulnerability CVE-2023-1389 and also notes that most of the scanning traffic originates from two ASNs, suggesting a concentrated effort from specific sources.101Views1like0CommentsThis Month In Security for October, 2022
This Month In Security is a partnership between F5 Security Incident Response Team's AaronJB(Aaron Brailsford), F5 Labs' David Warburton and Tafara Muwandi and F5 DevCentral's AubreyKingF5. This month's news includes some Supply Chain Security, Guidance from CISA and a worrisome UEFI Bootkit.379Views2likes0CommentsF5 Labs Publishes October Update to Sensor Intel Series
F5 Labs just launched the October installment in our growing Sensor Intel Series. The sensors in question come from our data partners Efflux, and allow us to get a sense of what kinds of vulnerabilities attackers are targeting from month to month. In September, the top-targeted vulnerability was CVE-2018-13379, a credential disclosure vulnerability in various versions of two Fortinet SSL VPNs. While nobody likes to see security tools with vulnerabilities, it is a change from the PHP remote code execution and IoT vulnerabilities that have made up the bulk of the targeting traffic over the last several months. We’ve also debuted a new visualization type for all 41 tracked vulnerabilities, making it a little easier to identify vulnerabilities with dramatic changes in targeting volume. At various times in the last nine months, CVE-2017-18368, CVE-2022-22947, and the vulnerabilities CVE-2021-22986 and CVE-2022-1388 (which are indistinguishable without examining headers in the HTTP request) have all shown growth rates at or near three orders of magnitude over a period of six to eight weeks, making them the fastest growing vulnerabilities since we’ve started this project. Stay tuned for the publication of the October SIS in early November. We are always looking for new CVEs to add and new ways to visualize the attack data.1.3KViews2likes0CommentsThe State of the State of Application Exploits in Security Incidents (SoSo Report)
Cybersecurity is always about perspective, and that is doubly true when talking about the rapidly changing field of application security. With The State of the State of Application Exploits in Security Incidents, F5 Labs & Cyentia Institute provide a more complete view of the application security elephant. We examine published industry reports from multiple sources for a better understanding of the frequency and role of application exploits. So, let’s start the clock to learn more about the affectionately named, SoSo Report. Get your copy at F5 Labs730Views0likes0CommentsF5 Labs Report: Cybersecurity Compliance Failures in Financial Services
One important piece of the 2021 Application Protection Report revealed that, of all breaches studied in 2020, the financial sector had the dubious honor of the highest percentage: 17 percent (17%). With breaches, come increased regulatory attention. In 2017, New York’s Department of Financial Services (NYDFS) enacted 23 NYCRR Part 500 regulations, calling out explicit cybersecurity requirements for financial services firms. Since then, three financial services organization that were breached have faced sobering consequences for failing to meet the NYDFS law. This in-depth article looks at each of those breaches in greater detail. Check outCybersecurity Compliance Failures in Financial Serviceson F5 Labs.621Views1like0Comments