f5 labs
19 TopicsWhat Are Scrapers and Why Should You Care?
Introduction Scrapers are automated tools designed to extract data from websites and APIs for various purposes, posing significant threats to organizations of all sizes. They can lead to intellectual property theft, competitive advantage erosion, website/API performance degradation, and legal liabilities. Scraping is one of the top 10 automated threats by OWASP, defined as using automation to collect application content and/or other data for use elsewhere. It impacts businesses across various industries and its legal status varies depending on geographic and legal jurisdictions. What is Scraping? Scraping involves requesting web pages, loading them, and parsing the HTML to extract the desired data and content. Examples of heavily scraped items include: Flights Hotel rooms Retail product prices Insurance rates Credit and mortgage interest rates Contact lists Store locations User profiles Scrapers use automation to make many smaller requests and put the data together in pieces, often with tens of thousands or even millions of individual requests. In the 2024 Bad Bots Review by F5 Labs, scraping bots were responsible for high levels of automation on two of the three most targeted flows, Search and Quotes, throughout 2023 across the entire F5 Bot Defense network. See figure 1 below. In addition, up to 70% of all search traffic originates from scrapers without advanced bot defense solutions. This percentage is based on the numerous proof of concept analyses done for enterprises with no advanced bot controls in place. Scraper versus Crawler or Spider Scrapers are different from crawlers or spiders in that they are mostly designed to get data and content from a website or API. Crawlers and spiders are used to list websites for search engines. Scrapers are designed to extract and exfiltrate data and content from the website or API, which can then be reused, resold, and otherwise repurposed as the scraper intends. Scraping is typically in violation of the terms and conditions of most websites and APIs, with some cases overturning previous rulings. Most scrapers target information on the web, but activity against APIs is on the rise. Business Models for Scraping There are many different parties active in the scraping business, with different business models and incentives for scraping content and data. Figure 2 below provides an overview of the various sources of scraping activity. The scraping industry involves various parties with different business models and incentives for scraping content and data. Search engine companies, such as Google, Bing, Facebook, Amazon, and Baidu, index content from websites to help users find things on the internet. Their business model is selling ads placed alongside search results. Competitors scrape content and data from each other to win customers, market share, and revenue. They use scraping to increase market share, competitive pricing, network scraping, inventory scraping, researchers, and investment firms, intellectual property owners, data aggregators, news aggregators, and AI companies. Competitors scrape pricing and availability of competitor products to win increased market share. Network scraping involves scraping the names, addresses, and contact details of a company's network partners, such as repair shops, doctors, hospitals, clinics, insurance agents, and brokers. Inventory scraping involves stealing valuable content and data from a competing site for use on their own site. Researchers and investment firms use scraping to gather data for their research and generate revenue by publishing and selling the results of their market research. Intellectual property owners use scraping to identify possible trademark or copyright infringements and ensure compliance with pricing and discounting guidelines. Data aggregators collect and aggregate data from various sources and sell it to interested parties. Some specialize in specific industries, while others use scrapers to pull news feeds, blogs, articles, and press releases from various websites and APIs. Artificial Intelligence (AI) companies scrape data across various industries, often without identifying themselves. As the AI space continues to grow, scraping traffic is expected to increase. Criminal organizations often scrape websites or applications for various malicious purposes--including phishing, vulnerability scanning, identity theft, and intermediation. Criminals use scrapers to create replicas of the victim’s website or app, requiring users to provide personal information (PII). They also use scrapers to test for vulnerabilities in the website or application, such as allowing them to access discounted rates or back-end systems. Costs of Scraping Direct costs of scraping include infrastructure costs, server performance, and outages, loss of revenue and market share, and intermediary-driven intermediation. Companies prefer direct relationships with customers for selling and marketing, customer retention, cross-selling, and upselling, and customer experience. However, indirect costs include loss of investment, intellectual property, reputational damage, legal liability, and questionable practices. Scraping can lead to a loss of revenue, profits, market share, and customer satisfaction. Indirect costs include the loss of intellectual property, reputational damage, legal liability, and questionable practices. Companies may lose control over the end-to-end customer experience when intermediaries are involved, leading to dissatisfied customers. Conclusion Scraping is a significant issue that affects enterprises worldwide in various industries. F5 Labs' research shows that almost 1 in 5 search and quote transactions are generated by scrapers. It is usually done by various entities, including search engines, competitors, AI companies, and malicious third parties. These costs result in the loss of revenue, profits, market share, and customer satisfaction. For a deeper dive into the impact of scraping on enterprises and effective mitigation strategies, read the full article on F5 Labs.90Views2likes0CommentsThis Month In Security for October, 2022
This Month In Security is a partnership between F5 Security Incident Response Team's AaronJB(Aaron Brailsford), F5 Labs' David Warburton and Tafara Muwandi and F5 DevCentral's AubreyKingF5. This month's news includes some Supply Chain Security, Guidance from CISA and a worrisome UEFI Bootkit.379Views2likes0CommentsF5 Labs Publishes October Update to Sensor Intel Series
F5 Labs just launched the October installment in our growing Sensor Intel Series. The sensors in question come from our data partners Efflux, and allow us to get a sense of what kinds of vulnerabilities attackers are targeting from month to month. In September, the top-targeted vulnerability was CVE-2018-13379, a credential disclosure vulnerability in various versions of two Fortinet SSL VPNs. While nobody likes to see security tools with vulnerabilities, it is a change from the PHP remote code execution and IoT vulnerabilities that have made up the bulk of the targeting traffic over the last several months. We’ve also debuted a new visualization type for all 41 tracked vulnerabilities, making it a little easier to identify vulnerabilities with dramatic changes in targeting volume. At various times in the last nine months, CVE-2017-18368, CVE-2022-22947, and the vulnerabilities CVE-2021-22986 and CVE-2022-1388 (which are indistinguishable without examining headers in the HTTP request) have all shown growth rates at or near three orders of magnitude over a period of six to eight weeks, making them the fastest growing vulnerabilities since we’ve started this project. Stay tuned for the publication of the October SIS in early November. We are always looking for new CVEs to add and new ways to visualize the attack data.1.3KViews2likes0CommentsSupplement To The 2021 App Protect Report
We frequently get requests to break down threats in a specific vertical. So, as a follow up to the F5 Labs, 2021 Application Protection Report (APR), we analyzed and visualized the attack chains of more than 700 data breaches looking for relationships between sectors or industries and the tactics and techniques attackers use against them. This effort produced the F5 Labs 2021 APR Supplement: Sectors and Vectors, where we found that while there are some attack patterns that correspond with sectors, the relationships appear indirect and partial, and counterexamples abound. The overall conclusion is that sectors can be useful for predicting an attack vector, but only in the absence of more precise information such as vulnerabilities or published exploits. This is because the types of data and vulnerabilities in the target environment, which determine an attacker’s approach, are no longer tightly correlated with the nature of the business. Look for more details about your sector (Finance, Education, Health Care, Scientific, Retail, etc) in the F5 Labs, 2021 APR Supplement: Of Sectors and Vectors.213Views2likes0CommentsHow to Identify and Manage Scrapers (Pt. 1)
Introduction The latest addition in our Scraper series focuses on how to identify and manage scrapers, but we’ll be splitting up the article into two parts. Part one will focus on outlining ways to identify and detect scrapers, while part two will focus on tactics to help manage scrapers. How to Identify Scraping Traffic The first step in identifying scraping traffic involves detecting various methods based on the scraper’s motivations and approaches. Some scrapers, like benign search bots, self-identify for network and security permission. Others, like AI companies, competitors, and malicious scrapers, hide themselves, making detection difficult. More sophisticated approaches are needed to combat these types of scrapers. Self-Identifying Scrapers There are several scrapers that announce themselves and make it very easy to identify them. These bots self-identify using the HTTP user agent string, indicating explicit permission or belief in providing valuable service. These bots can be classified into three categories. Search Engine Bots/Crawlers Performance or Security Monitoring Archiving Several scraper websites offer detailed information on their scrapers, including identification, IP addresses, and opt-out options. It's crucial to review these documents for scrapers of interest, as unscrupulous scrapers often impersonate known ones. Websites often provide tools to verify if a scraper is real or an imposter. Links to these documentation and screenshots are provided in our full blog on F5 Labs. Many scrapers identify themselves via the user agent string. A string is usually added to the user-agent string that contains the following. The name of the company, service or tool that is doing the scraping A website address for the company, service or tool that is doing the scraping A contact email for the administrator of the entity doing the scraping Other text explaining what the scraper is doing or who they are A key way to identify self-identifying scrapers is to search the user-agent field in your server logs for specific strings. Table 1 below outlines common strings you can look for. Table 1: Search strings to find self-identifying scrapers (* is a wildcard) Self Identification method Search String Name of the tool or service *Bot * or *bot* Website address *www* or *.com* Contact Email *@* Examples of User Agent Strings OpenAI searchbot user agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI- SearchBot/1.0; +https://openai.com/searchbot Bing search bot user agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; + http://www.bing.com/bingbot.htm) Chrome/ These scrapers have both the name of the tool or service, as well as the website in the user-agent string and can be identified using two of the methods highlighted in Table 1 above. Impersonation Because user agents are self-reported, they are easily spoofed. Any scraper can pretend to be a known entity like Google bot by simply presenting the Google bot user agent string. We have observed countless examples of fake bots impersonating large known scrapers like Google, Bing and Facebook. As one example, Figure 1 below shows here the traffic overview of a fake Google scraper bot. This scraper was responsible for almost a hundred thousand requests per day against a large US hotel chain’s room search endpoints. The bot used the following user-agent string, which is identical to the one used by the real Google bot. Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) (765362) IP-based Identification Scrapers can identify themselves via their IP addresses. Whois lookups can reveal the IP address of a scraper, their organization, or their registered ASNs. While not revealing the identity of the actual entity, they can be useful in certain cases. Geolocation information can also be used to identify automated scraping activity. Reverse DNS lookups help identify the scraper’s identity by using the Domain Name System (DNS) to find the domain name associated with an IP address, which can be identified by using free online reverse DNS lookup services. Since IP address spoofing is non-trivial, identifying and allowlisting scrapers using IP addresses is more secure than simply using user agents. Artificial Intelligence (AI) Scrapers Artificial intelligence companies are increasingly using internet scraping to train models, causing a surge in data scraping. This data is often used for for-profit AI services, which sometimes compete with scraping victims. Several lawsuits are currently underway against these companies. A California class-action lawsuit has been filed by 16 claimants against OpenAI, alleging copyright infringement due to the scraping and use of their data for model training. Due to all the sensitivity around AI companies scraping data from the internet, a few things have happened. Growing scrutiny of these companies has forced them to start publishing details of their scraping activity and ways to both identify these AI scrapers as well as ways to opt out of your applications being scraped. AI companies have seen an increase in opt-outs from AI scraping, resulting in them being unable to access the data needed to power their apps. Some less ethical AI companies have since set up alternative “dark scrapers” which do not self-identify, and instead secretly continue to scrape the data needed to power their AI services. Unidentified Scrapers Most scrapers don't identify themselves or request explicit permission, leaving application, network, and security teams unaware of their activities on Web, Mobile, and API applications. Identifying scrapers can be challenging, but below you'll find two techniques that we have used in the past that can help identify the organization or actors behind them. To view additional techniques along with an in-depth explanation of each, head over toour blog post on F5 Labs. 1. Requests for Obscure or Non-Existent Resources Website scrapers crawl obscure or low-volume pages, requesting resources like flight availability and pricing. They construct requests manually, sending them directly to airline origin servers. Figure 2 shows an example of a scraper that was scraping an airline’s flights and requesting flights to and from a train station. 2. IP Infrastructure Analysis, Use of Hosting Infra or Corporate IP Ranges (Geo Location Matching) Scrapers distribute traffic via proxy networks or bot nets to avoid IP-based rate limits, making it easier to identify them. Some of these tactics include: Round-robin IP or UA usage Use of hosting IPs Use of low-reputations IPs Use of international IPs that do not match expected user locations The following are additional things to keep in mind when trying to identify scrapers. We provide an in-depth overview of each in our full article on F5 Labs. Conversion or look-to-book analysis Not downloading or fetching images and dependencies but just data Behavior/session analysis Conclusion We discussed two methods above that might be helpful in identifying a scraper. However, keep in mind that it's crucial to take into account the type of scraper and the sort of data they are targeting in order to correctly identify it. To read the full article on identifying scrapers, which includes more identification methods, head on over toour post on F5 Labs. Otherwise, continue on to part two where we’ll outline tactics to help manage scrapers.48Views1like0CommentsSIS March 2024: TP-Link Archer AX21 Wifi Router targeting, plus a handful of new CVEs!
The March 2024 Sensor Intelligence Series report highlights a significant surge in scanning activity for the vulnerability CVE-2023-1389 and also notes that most of the scanning traffic originates from two ASNs, suggesting a concentrated effort from specific sources.102Views1like0CommentsF5 Labs 2019 TLS Telemetry Report Summary
Encryption standards are constantly evolving, so it is important to stay up to date with best practices. The 2019 F5 Labs TLS Telemetry Summary Report by David Warburton with additional contributions from Remi Cohen and Debbie Walkowski expands the scope of our research to bring you deeper insights into how encryption on the web is constantly evolving. We look into which ciphers and SSL/TLS versions are being used to secure the Internet’s top websites and, for the first time, examine the use of digital certificates on the web and look at supporting protocols (such as DNS) and application layer headers. On average, almost 86% of all page loads over the web are now encrypted with HTTPS. This is a win for consumer privacy and security, but it’s also posing a problem for those scanning web traffic. In our research we found that 71% of phishing sites in July 2019 were using secure HTTPS connections with valid digital certificates. This means we have to stop training users to “look for the HTTPS at the start of the address” since attackers are using deceptive URLs to emulate secure connections for their phishing and malware sites. Read our report for details and recommendations on how to bolster your HTTPS connections.352Views1like0CommentsThe F5 Labs 2019 Application Protection Report
For the past years, F5 Labs has produced the Application Protection Research Series. First as individual reports and then as a series of episodes released during the year. We have just released the 2019 report final edition, which places years of security trends and patterns into a single long-term picture, to get away from news cycles and hype that only focus on new threats or vulnerabilities that may not even be applicable. This perspective also allows us to see linkages between the different subdomains and foci that make up the complex and porous field we call information security. This new comprehensive report pulls together the various threats, data sources, and patterns in the previous episodes into a unified line of inquiry that began in early 2019, picking up where the 2018 Application Protection Report left off, and concluded in early 2020 with updated data on 2019 breaches and architectural risk. One of the underlying themes for the 2019 series has been that changes in the ways that we design, build, and deploy applications have been drivers for risk. From third-party services driving the rise of an injection attack known as formjacking, to a growing list of seemingly avoidable API breaches, to the prevalence of platforms running on languages with old and documented flaws, there has been a good deal of goalpost movement for defenders. The implication is that many of the people who are making decisions with significant ramifications for security—system owners, application architects, DevOps teams—are generally placing other priorities ahead of security. Based on the acceleration of trends in 2019 that we identified from 2018, it seems that this tension will characterize the next few years of the security arms race. Our top conclusions in this report include: Access attacks predominant except for retail Retail breaches increasingly dominated by formjacking Breach modes driven more by application architecture than by traditional sector Get the Full report here https://www.f5.com/labs/articles/threat-intelligence/2019-application-protection-report Executive Summary https://www.f5.com/labs/articles/threat-intelligence/application-protection-research-series-executive-summary348Views1like0CommentsF5 Labs Report: Cybersecurity Compliance Failures in Financial Services
One important piece of the 2021 Application Protection Report revealed that, of all breaches studied in 2020, the financial sector had the dubious honor of the highest percentage: 17 percent (17%). With breaches, come increased regulatory attention. In 2017, New York’s Department of Financial Services (NYDFS) enacted 23 NYCRR Part 500 regulations, calling out explicit cybersecurity requirements for financial services firms. Since then, three financial services organization that were breached have faced sobering consequences for failing to meet the NYDFS law. This in-depth article looks at each of those breaches in greater detail. Check outCybersecurity Compliance Failures in Financial Serviceson F5 Labs.621Views1like0Comments