f5 labs
26 TopicsWhat Are Scrapers and Why Should You Care?
Introduction Scrapers are automated tools designed to extract data from websites and APIs for various purposes, posing significant threats to organizations of all sizes. They can lead to intellectual property theft, competitive advantage erosion, website/API performance degradation, and legal liabilities. Scraping is one of the top 10 automated threats by OWASP, defined as using automation to collect application content and/or other data for use elsewhere. It impacts businesses across various industries and its legal status varies depending on geographic and legal jurisdictions. What is Scraping? Scraping involves requesting web pages, loading them, and parsing the HTML to extract the desired data and content. Examples of heavily scraped items include: Flights Hotel rooms Retail product prices Insurance rates Credit and mortgage interest rates Contact lists Store locations User profiles Scrapers use automation to make many smaller requests and put the data together in pieces, often with tens of thousands or even millions of individual requests. In the 2024 Bad Bots Review by F5 Labs, scraping bots were responsible for high levels of automation on two of the three most targeted flows, Search and Quotes, throughout 2023 across the entire F5 Bot Defense network. See figure 1 below. In addition, up to 70% of all search traffic originates from scrapers without advanced bot defense solutions. This percentage is based on the numerous proof of concept analyses done for enterprises with no advanced bot controls in place. Scraper versus Crawler or Spider Scrapers are different from crawlers or spiders in that they are mostly designed to get data and content from a website or API. Crawlers and spiders are used to list websites for search engines. Scrapers are designed to extract and exfiltrate data and content from the website or API, which can then be reused, resold, and otherwise repurposed as the scraper intends. Scraping is typically in violation of the terms and conditions of most websites and APIs, with some cases overturning previous rulings. Most scrapers target information on the web, but activity against APIs is on the rise. Business Models for Scraping There are many different parties active in the scraping business, with different business models and incentives for scraping content and data. Figure 2 below provides an overview of the various sources of scraping activity. The scraping industry involves various parties with different business models and incentives for scraping content and data. Search engine companies, such as Google, Bing, Facebook, Amazon, and Baidu, index content from websites to help users find things on the internet. Their business model is selling ads placed alongside search results. Competitors scrape content and data from each other to win customers, market share, and revenue. They use scraping to increase market share, competitive pricing, network scraping, inventory scraping, researchers, and investment firms, intellectual property owners, data aggregators, news aggregators, and AI companies. Competitors scrape pricing and availability of competitor products to win increased market share. Network scraping involves scraping the names, addresses, and contact details of a company's network partners, such as repair shops, doctors, hospitals, clinics, insurance agents, and brokers. Inventory scraping involves stealing valuable content and data from a competing site for use on their own site. Researchers and investment firms use scraping to gather data for their research and generate revenue by publishing and selling the results of their market research. Intellectual property owners use scraping to identify possible trademark or copyright infringements and ensure compliance with pricing and discounting guidelines. Data aggregators collect and aggregate data from various sources and sell it to interested parties. Some specialize in specific industries, while others use scrapers to pull news feeds, blogs, articles, and press releases from various websites and APIs. Artificial Intelligence (AI) companies scrape data across various industries, often without identifying themselves. As the AI space continues to grow, scraping traffic is expected to increase. Criminal organizations often scrape websites or applications for various malicious purposes--including phishing, vulnerability scanning, identity theft, and intermediation. Criminals use scrapers to create replicas of the victim’s website or app, requiring users to provide personal information (PII). They also use scrapers to test for vulnerabilities in the website or application, such as allowing them to access discounted rates or back-end systems. Costs of Scraping Direct costs of scraping include infrastructure costs, server performance, and outages, loss of revenue and market share, and intermediary-driven intermediation. Companies prefer direct relationships with customers for selling and marketing, customer retention, cross-selling, and upselling, and customer experience. However, indirect costs include loss of investment, intellectual property, reputational damage, legal liability, and questionable practices. Scraping can lead to a loss of revenue, profits, market share, and customer satisfaction. Indirect costs include the loss of intellectual property, reputational damage, legal liability, and questionable practices. Companies may lose control over the end-to-end customer experience when intermediaries are involved, leading to dissatisfied customers. Conclusion Scraping is a significant issue that affects enterprises worldwide in various industries. F5 Labs' research shows that almost 1 in 5 search and quote transactions are generated by scrapers. It is usually done by various entities, including search engines, competitors, AI companies, and malicious third parties. These costs result in the loss of revenue, profits, market share, and customer satisfaction. For a deeper dive into the impact of scraping on enterprises and effective mitigation strategies, read the full article on F5 Labs.155Views2likes0CommentsThis Month In Security for October, 2022
This Month In Security is a partnership between F5 Security Incident Response Team's AaronJB (Aaron Brailsford), F5 Labs' David Warburton and Tafara Muwandi and F5 DevCentral's AubreyKingF5. This month's news includes some Supply Chain Security, Guidance from CISA and a worrisome UEFI Bootkit.389Views2likes0CommentsF5 Labs Publishes October Update to Sensor Intel Series
F5 Labs just launched the October installment in our growing Sensor Intel Series. The sensors in question come from our data partners Efflux, and allow us to get a sense of what kinds of vulnerabilities attackers are targeting from month to month. In September, the top-targeted vulnerability was CVE-2018-13379, a credential disclosure vulnerability in various versions of two Fortinet SSL VPNs. While nobody likes to see security tools with vulnerabilities, it is a change from the PHP remote code execution and IoT vulnerabilities that have made up the bulk of the targeting traffic over the last several months. We’ve also debuted a new visualization type for all 41 tracked vulnerabilities, making it a little easier to identify vulnerabilities with dramatic changes in targeting volume. At various times in the last nine months, CVE-2017-18368, CVE-2022-22947, and the vulnerabilities CVE-2021-22986 and CVE-2022-1388 (which are indistinguishable without examining headers in the HTTP request) have all shown growth rates at or near three orders of magnitude over a period of six to eight weeks, making them the fastest growing vulnerabilities since we’ve started this project. Stay tuned for the publication of the October SIS in early November. We are always looking for new CVEs to add and new ways to visualize the attack data.1.3KViews2likes0CommentsSupplement To The 2021 App Protect Report
We frequently get requests to break down threats in a specific vertical. So, as a follow up to the F5 Labs, 2021 Application Protection Report (APR), we analyzed and visualized the attack chains of more than 700 data breaches looking for relationships between sectors or industries and the tactics and techniques attackers use against them. This effort produced the F5 Labs 2021 APR Supplement: Sectors and Vectors, where we found that while there are some attack patterns that correspond with sectors, the relationships appear indirect and partial, and counterexamples abound. The overall conclusion is that sectors can be useful for predicting an attack vector, but only in the absence of more precise information such as vulnerabilities or published exploits. This is because the types of data and vulnerabilities in the target environment, which determine an attacker’s approach, are no longer tightly correlated with the nature of the business. Look for more details about your sector (Finance, Education, Health Care, Scientific, Retail, etc) in the F5 Labs, 2021 APR Supplement: Of Sectors and Vectors.232Views2likes0CommentsWhat is Quantum Computing?
Quantum computing represents a significant shift in information processing. It leverages the principles of quantum mechanics to solve problems far beyond the capabilities of classical computers. Unlike classical computers, which use bits to represent either 0 or 1, quantum computers use qubits. This enables them to exist in multiple states simultaneously through superposition. Additional quantum properties like entanglement and quantum interference further enhance computational efficiency, making quantum systems uniquely equipped to tackle complex, intractable problems. This breakthrough has profound implications for cryptography. Many classical cryptosystems, such as RSA and ECC, rely on mathematical problems that are easy to compute but difficult to reverse without a secret key. Quantum algorithms like Shor’s Algorithm can solve these problems quickly. This makes traditional encryption vulnerable to quantum-based attacks. Similarly, Grover’s Algorithm increases the speed of brute-force searches, halving the effective security of symmetric cryptographic algorithms like AES. Quantum computing has caused the need for new cryptography systems. These systems are designed to protect against attacks from quantum computers. Notably, these systems don’t require quantum properties themselves; instead, they employ mathematical techniques robust against quantum algorithms. For example, lattice-based cryptography is considered one of the most promising approaches for ensuring future-proof security. As quantum computing capabilities progress, experts warn that classical encryption methods may soon reach the end of their "cryptographic cover time," the duration during which encrypted data remains secure. Data intercepted today could be decrypted retroactively by adversaries when quantum threat models mature—a concept referred to as "harvest now, decrypt later." This underscores the urgency of transitioning to quantum-resistant technologies. Post-quantum cryptographic algorithms, combined with hybrid approaches in protocols like TLS, can protect sensitive communications from future quantum threats. Given estimates that functional quantum computers capable of breaking RSA-2048 could emerge within the next decade, governments and organizations are advised to begin implementing these technologies now to ensure long-term data security. For a deeper exploration of quantum computing and its cryptographic implications, read the full F5 Labs article.61Views1like0CommentsF5 Labs Top CWEs, CWE OWASP Top Ten Analysis, & May 2025 CVE Trends
For May’s vulnerability analysis (https://www.f5.com/labs/articles/threat-intelligence/f5-labs-top-cwes-owasp-top-ten-analysis), we examine the top ten CVEs most targeted, highlighting notable shifts and ongoing trends in exploitation activity. Additionally, we provide analysis of a year's worth of targeted CVE traffic through the lens of primary Common Weakness Enumerations (CWEs) and the OWASP Top Ten categories.162Views1like0CommentsUnderstanding The TikTok Ban, Salt Typhoon and More | AppSec Monthly January Ep.27
In this episode of AppSec Monthly, our host MegaZone is joined by m_heath, Merlyn Albery-Speyer, and AubreyKingF5, as they dive into the latest cybersecurity news. We explore the complexities of the TikTok ban, the impact of geopolitical decisions on internet freedom, and the nuances of data sovereignty. Our experts also discuss the implications of recent breaches by Chinese state actors and the importance of using end-to-end encrypted apps to protect your data. Additionally, we shed light on the fascinating history of internet control and how it continues to evolve with emerging technologies. Stay tuned until the end for insights on the upcoming VulnCon 2025 and how you can participate. Don’t forget to subscribe for more AppSec insights!65Views1like0CommentsContinued Intense Scanning From One IP in Lithuania
Welcome to the September 2024 installment of the Sensor Intelligence Series (SIS), our monthly summary of vulnerability intelligence based on distributed passive sensor data. Below are a few key takeaways from this month’s summary. Scanning for CVE-2017-9841 dropped by 10% (vs. August). CVE-2023-1389 continues to be the most scanned CVE we track, with a 400% increase over August. One IP address continues to be the most observed, accounting for 43% of overall scanning traffic observed. We see a spike in the scanning of CVE-2023-25157, a Critical vulnerability in the GeoServer software project. CVE Scanning Following on from our last month’s analysis, the scanning of CVE-2017-9841 has decreased by 10% compared to August and is down 99.8% from its high-water mark in June of 2024, and nearly vanishing from our visualizations. CVE-2023-1389, an RCE vulnerability in TP-Link Archer AX21 routers, has been the most scanned CVE for the last two months, increasing 400% over August. While this sort of swing in volume may seem remarkable, as we have noticed before, it’s not unusual when we analyze the shape of the scanning for a particular CVE over time. Following Up on an Aberration Last month, a pattern of scanning activity was identified coming from a specific IPv4 address (141.98.11.114), which was suspected to be the BotPoke scanner. Despite a slight decrease in scanning traffic, this IP continued to target the same URIs and regions where our sensors are located, accounting for 43% of the overall scanning traffic observed. A Brief Note on Malware Stagers Observed Our passive sensors, which do not respond to traffic, limit our ability to predict secondary actions after successful exploitation. However, we can show that some CVEs are attempted to be used and downloaded malware stagers. To view an example of the most common URL observed in September attempting to exploit CVE-2023-1389 visit F5 Labs to read the full summary. September Vulnerabilities by the Numbers Figure 1 shows September attack traffic for the top ten CVEs, with CVE-2023-1389 dominating. Increased scanning for this vulnerability throws off the proportionality of this view. However, see the logarithmic scale (figure 3) for an easier view. Figure 2 shows a significant increase in scanning for CVE-2023-1389 over the past year, while a decline in scanning for CVE-2017-9841 persists. Long-Term Trends Figure 3 shows the traffic for the top 19 CVEs, with CVE-2017-8941 and CVE-2023-1389 showing significant increases. The average of the other 110 CVEs has fallen dramatically. CVE-2023-25157, a critical vulnerability in the GeoServer software project, has seen a dramatic increase in scanning. The log scale helps show changes in other top 10 scanned CVEs. To find out more about September’s CVEs and for recommendations on how to stay ahead of the curve in cybersecurity, check out the full article here. We’ll see you next month!131Views1like0CommentsHow to Identify and Manage Scrapers (Pt. 1)
Introduction The latest addition in our Scraper series focuses on how to identify and manage scrapers, but we’ll be splitting up the article into two parts. Part one will focus on outlining ways to identify and detect scrapers, while part two will focus on tactics to help manage scrapers. How to Identify Scraping Traffic The first step in identifying scraping traffic involves detecting various methods based on the scraper’s motivations and approaches. Some scrapers, like benign search bots, self-identify for network and security permission. Others, like AI companies, competitors, and malicious scrapers, hide themselves, making detection difficult. More sophisticated approaches are needed to combat these types of scrapers. Self-Identifying Scrapers There are several scrapers that announce themselves and make it very easy to identify them. These bots self-identify using the HTTP user agent string, indicating explicit permission or belief in providing valuable service. These bots can be classified into three categories. Search Engine Bots/Crawlers Performance or Security Monitoring Archiving Several scraper websites offer detailed information on their scrapers, including identification, IP addresses, and opt-out options. It's crucial to review these documents for scrapers of interest, as unscrupulous scrapers often impersonate known ones. Websites often provide tools to verify if a scraper is real or an imposter. Links to these documentation and screenshots are provided in our full blog on F5 Labs. Many scrapers identify themselves via the user agent string. A string is usually added to the user-agent string that contains the following. The name of the company, service or tool that is doing the scraping A website address for the company, service or tool that is doing the scraping A contact email for the administrator of the entity doing the scraping Other text explaining what the scraper is doing or who they are A key way to identify self-identifying scrapers is to search the user-agent field in your server logs for specific strings. Table 1 below outlines common strings you can look for. Table 1: Search strings to find self-identifying scrapers (* is a wildcard) Self Identification method Search String Name of the tool or service *Bot * or *bot* Website address *www* or *.com* Contact Email *@* Examples of User Agent Strings OpenAI searchbot user agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI- SearchBot/1.0; +https://openai.com/searchbot Bing search bot user agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; + http://www.bing.com/bingbot.htm) Chrome/ These scrapers have both the name of the tool or service, as well as the website in the user-agent string and can be identified using two of the methods highlighted in Table 1 above. Impersonation Because user agents are self-reported, they are easily spoofed. Any scraper can pretend to be a known entity like Google bot by simply presenting the Google bot user agent string. We have observed countless examples of fake bots impersonating large known scrapers like Google, Bing and Facebook. As one example, Figure 1 below shows here the traffic overview of a fake Google scraper bot. This scraper was responsible for almost a hundred thousand requests per day against a large US hotel chain’s room search endpoints. The bot used the following user-agent string, which is identical to the one used by the real Google bot. Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) (765362) IP-based Identification Scrapers can identify themselves via their IP addresses. Whois lookups can reveal the IP address of a scraper, their organization, or their registered ASNs. While not revealing the identity of the actual entity, they can be useful in certain cases. Geolocation information can also be used to identify automated scraping activity. Reverse DNS lookups help identify the scraper’s identity by using the Domain Name System (DNS) to find the domain name associated with an IP address, which can be identified by using free online reverse DNS lookup services. Since IP address spoofing is non-trivial, identifying and allowlisting scrapers using IP addresses is more secure than simply using user agents. Artificial Intelligence (AI) Scrapers Artificial intelligence companies are increasingly using internet scraping to train models, causing a surge in data scraping. This data is often used for for-profit AI services, which sometimes compete with scraping victims. Several lawsuits are currently underway against these companies. A California class-action lawsuit has been filed by 16 claimants against OpenAI, alleging copyright infringement due to the scraping and use of their data for model training. Due to all the sensitivity around AI companies scraping data from the internet, a few things have happened. Growing scrutiny of these companies has forced them to start publishing details of their scraping activity and ways to both identify these AI scrapers as well as ways to opt out of your applications being scraped. AI companies have seen an increase in opt-outs from AI scraping, resulting in them being unable to access the data needed to power their apps. Some less ethical AI companies have since set up alternative “dark scrapers” which do not self-identify, and instead secretly continue to scrape the data needed to power their AI services. Unidentified Scrapers Most scrapers don't identify themselves or request explicit permission, leaving application, network, and security teams unaware of their activities on Web, Mobile, and API applications. Identifying scrapers can be challenging, but below you'll find two techniques that we have used in the past that can help identify the organization or actors behind them. To view additional techniques along with an in-depth explanation of each, head over to our blog post on F5 Labs. 1. Requests for Obscure or Non-Existent Resources Website scrapers crawl obscure or low-volume pages, requesting resources like flight availability and pricing. They construct requests manually, sending them directly to airline origin servers. Figure 2 shows an example of a scraper that was scraping an airline’s flights and requesting flights to and from a train station. 2. IP Infrastructure Analysis, Use of Hosting Infra or Corporate IP Ranges (Geo Location Matching) Scrapers distribute traffic via proxy networks or bot nets to avoid IP-based rate limits, making it easier to identify them. Some of these tactics include: Round-robin IP or UA usage Use of hosting IPs Use of low-reputations IPs Use of international IPs that do not match expected user locations The following are additional things to keep in mind when trying to identify scrapers. We provide an in-depth overview of each in our full article on F5 Labs. Conversion or look-to-book analysis Not downloading or fetching images and dependencies but just data Behavior/session analysis Conclusion We discussed two methods above that might be helpful in identifying a scraper. However, keep in mind that it's crucial to take into account the type of scraper and the sort of data they are targeting in order to correctly identify it. To read the full article on identifying scrapers, which includes more identification methods, head on over to our post on F5 Labs. Otherwise, continue on to part two where we’ll outline tactics to help manage scrapers.129Views1like0Comments