ai

82 Topics

F5 Distributed Cloud WAF AI/ML Model to Suppress False Positives
Introduction: Web Application Firewall (WAF) has evolved to protect web applications from attack. A signature-based WAF responds to threats through the implementation of application-specific detection rules which block malicious traffic. These managed rules work extremely well for patterns of established attack vectors, as they have been extensively tested to minimize both false negatives and false positives. Most of the Web Applications development is concentrated to deliver services seamlessly rather than integrating security services to tackle recent or every security attack. Some applications might have a logic or an operation that looks suspicious and might trigger a WAF rule. But that is how applications are built and made to behave depending on their purpose. Under these circumstances WAF considers requests to these areas as attack, which is truly not, and the respective attack signature is invoked which is called as False Positive. Though the requests are legitimate WAF blocks these requests. It is tedious to update the signature rule set which requires greater human effort. AI/ML helps to solve this problem so that the real user requests are not blocked by WAF. This article aims to provide configuration of WAF along with Automatic attack signature tuning to suppress false positives using AI/ML model. A More Intelligent Solution: F5 Distributed Cloud (F5 XC) AI/ML model uses self-learning probabilistic machine learning model that suppresses false positives triggered by Signature Engine. AI/ML is a tool that identifies the false positives triggered by signature engine and acts as an additional layer of intelligence, which automatically suppresses false positives based on a Machine learning model without human intervention. This model minimizes false positives and helps to determine the probability that triggered the particular signature is evidence of an attack or just an error or a change in how users interact with the application. This model is trained using vast amount of benign and an attack traffic of real time customer log. AI/ML model does not rely on human involvement to understand operational patterns and user interactions with Web Application. Hence it saves a lot of human effort. Step by step procedure to enable attack signature tuning to supress false positives These are the steps to enable attack signatures and its accuracy Create a firewall by enabling Automatic attack signatures Assign the firewall to Load Balancer Step 1: Create an App Firewall Navigate to F5 XC Console Home > Load Balancers > Security > App Firewall and click on Add App Firewall Enter valid name for Firewall and Navigate to Detection Settings Select Security Policy as “Custom” with in the Detection settings and select Automatic Attack Signatures Tuning “Enable” as shown below, Select Signature Selection by Accuracy as “High and Medium” from the dropdown. Scroll down to the bottom and click on “Save and Exit” button. Steps 2: Assigning the Firewall to the Load Balancer From the F5 XC Console homepage, Navigate to Load Balancers > Manage > Load Balancers > HTTP load balancer Select the load balancer to which above created Firewall to be assigned. Click on menu in Actions column of app Load Balancer and click on Manage Configurations as shown below to display load balancer configs. Once Load Balancer configurations are displayed click on Edit configuration button on the top right of the page. Navigate to Security Configuration settings and choose Enable in dropdown of Web Application Firewall (WAF) Assign the Firewall to the Load Balancer which is created in step 1 by selecting the name from the Enable dropdown as shown below, Scroll down to the bottom and click on “Save and Exit” button, with this Firewall is assigned to Load Balancer. Step 3: Verify the auto supressed signatures for false positives From the F5 XC Console homepage, Navigate to Web App and API Protection > Apps & APIs > Security and select the Load Balancer Select Security Events and click on Add filter Enter the key word Signatures.states and select Auto Supressed. Displayed logs shows the Signatures that are auto supressed by AI/ML Model. "Nature is a mutable cloud, which is always and never the same." - Ralph Waldo Emerson We might not wax that philosophically around here, but our heads are in the cloud nonetheless! Join the F5 Distributed Cloud user group today and learn more with your peers and other F5 experts. Conclusion: With the additional layer of intelligence to the signature engine F5 XC's AI/ML model can automatically suppresses false positives without human intervention. Customer can be less concerned about their activities of application that look suspicious which in turns to be actual behaviour and hence the legitimate requests are not blocked by this model. Decisions are based on enormous amount of real data fed to the system to understand application and user’s behaviour which makes this model more intelligent.
chaithanya_dileep
Sep 04, 2022 Place Technical Articles
6.6KViews
7likes
9Comments
SSL Orchestrator Advanced Use Cases: Detecting Generative AI
Introduction Quick, take a look at the following list and answer this question: "What do these movies have in common?" 2001: A Space Odyssey Westworld Tron WarGames Electric Dreams The Terminator The Matrix Eagle Eye Ex Machina Avengers: Age of Ultron M3GAN If you answered, "They're all about artificial intelligence", yes, but... If you answered, "They're all about artificial intelligence that went terribly, sometimes horribly wrong", you'd be absolutely correct. The simple fact is...artificial intelligence (AI) can be scary. Proponents for, and opponents against will disagree on many aspects, but they can all at least acknowledge there's a handful of ways to do AI correctly...and a million ways to do it badly. Not to be an alarmist, but while SkyNet was fictional, semi-autonomous guns on robot dogs is not... But then why am I talking about this on a technical forum you may ask? Well, when most of the above films were made, AI was largely still science fiction. That's clearly not the case anymore, and tools like ChatGPT are just the tip of the coming AI frontier. To be fair, I don't make the claim that all AI is bad, and many have indeed lauded ChatGPT and other generative AI tools as the next great evolution in technology. But it's also fair to say that generative AI tools, like ChatGPT, have a very real potential to cause harm. At the very least, these tools can be convincing, even when they're wrong. And worse, they could lead to sensitive information disclosures. One only has to do a cursory search to find a few examples of questionable behavior: Lawyers File Motion Written by AI, Face Sanctions and Possible Disbarment Higher Ed Beware: 10 Dangers of ChatGPT Schools Need to Know ChatGPT and AI in the Workplace: Should Employers Be Concerned? OpenAI's New Chatbot Will Tell You How to Shoplift and Make Explosives Giant Bank JP Morgan Bans ChatGPT Use Among Employees Samsung Bans ChatGPT Among Employees After Sensitive Code Leak But again...what does this have to do with a technical forum? And more important, what does this have to do with you? Simply stated, if you are in an organization where generative AI tools could be abused, understanding, and optionally controlling how and when these tools are accessed, could help to prevent the next big exploit or disclosure. If you search beyond the above links, you'll find an abundance of information on both the benefits, and security concerns of AI technologies. And ultimately you'll still be left to decide if these AI tools are safe for your organization. It may simply be worthwhile to understand WHAT tools are being used. And in some cases, it may be important to disable access to these. Given the general depth and diversity of AI functions within arms-reach today, and growing, it'd be irresponsible to claim "complete awareness". The bulk of these functions are delivered over standard HTTPS, so the best course of action will be to categorize on known assets, and adjust as new ones come along. As of the publishing of this article, the industry has yet to define a standard set of categories for AI, and specifically, generative AI. So in this article, we're going to build one and attach that to F5 BIG-IP SSL Orchestrator to enable proactive detection and optional control of Internet-based AI tool access in your organization. Let's get started! BIG-IP SSL Orchestrator Use Case: Detecting Generative AI The real beauty of this solution is that it can be implemented faster than it probably took to read the above introduction. Essentially, you're going to create a custom URL category on F5 BIG-IP, populate that with known generative AI URLs, and employ that custom category in a BIG-IP SSL Orchestrator security policy rule. Within that policy rule, you can elect to dynamically decrypt and send the traffic to the set of inspection products in your security enclave. Step 1: Create the custom URL category and populate with known AI URLs - Access the BIG-IP command shell and run the following command. This will initiate a script that creates and populates the URL category: curl -s https://raw.githubusercontent.com/f5devcentral/sslo-script-tools/main/sslo-generative-ai-categories/sslo-create-ai-category.sh |bash Step 2: Create a BIG-IP SSL Orchestrator policy rule to use this data - The above script creates/re-populates a custom URL category named SSLO_GENERATIVE_AI_CHAT, and in that category is a set of known generative AI URLs. To use, navigate to the BIG-IP SSL Orchestrator UI and edit a Security Policy. Click add to create a new rule, use the "Category Lookup (All)" policy condition, then add the above URL category. Set the Action to "Allow", SSL Proxy Action to "Intercept", and Service Chain to whatever service chain you've already created. With Summary Logging enabled in the BIG-IP SSL Orchestrator topology configuration, you'll also get Syslog reporting for each AI resource match - who made the request, to what, and when. The URL category is employed here to identify known AI tools. In this instance, BIG-IP SSL Orchestrator is used to make that assessment and act on it (i.e. allow, TLS intercept, service chain, log). Should you want even more granular control over conditions and actions of the decrypted AI tool traffic, you can also deploy an F5 Secure Web Gateway Services policy inside the SSL Orchestrator service chain. With SWG, you can expand beyond simple detection and blocking, and build more complex rules to decide who can access, when, and how. It should be said that beyond logging, allowing, or denying access to generative AI tools, SSL Orchestrator is also going to provide decryption and the opportunity to dynamically steer the decrypted AI traffic to any set of security products best suited to protect against any potential malware. Summary As previously alluded, this is not an exhaustive list of AI tool URLs. Not even close. But it contains the most common you'll see in the wild. The above script populates with an initial list of URLs that you are free to update as you become aware of new one. And of course we invite you to recommend additional AI tools to add to this list. References: https://github.com/f5devcentral/sslo-script-tools/tree/main/sslo-generative-ai-categories
Kevin_Stewart
Aug 13, 2023 Place Technical Articles
2.5KViews
5likes
0Comments
Using ChatGPT for security and introduction of AI security
TL;DR There are many security services that uses ChatGPT. Methods to attack against AI are, for example, input noises, poisoning data, or reverse engineering. Introduction If you hear about "AI and security", 2 things can be considered. First, using AI for cyber security. Second, attack against AI. In this article, I am going to discuss these topics. - Using AI for security: Introducing some security application that uses ChatGPT. - Attack against AI: What it is. Using AI (ChatGPT) for security (purpose) Since the announcement of GPT-3 in September 2020 and the release of many image-generating AIs in 2022, using AI become commonplace. Especially, after the release of ChatGPT in November 2022, it immediately got popular because of its ability to generate quite natural sentences for human. ChatGPT is also used to code from human's natural languages, and also can be used to explain the meaning of the codes, memory dumps, or logs in a way that is easy for human to understand. Finding an unusual pattern from a large amount of data is what AI is good at. Hence, there is a service to use AI for Incident Response. - Microsoft Security Copilot : Security Incident response adviser. This research uses ChatGPT to detect phishing sites and marked 98.3% of accuracy. - Detecting Phishing Sites Using ChatGPT Of course, the ChatGPT can be used for penetration testing. - PentestGPT　 However no one is willing to share sensitive information with Microsoft or other vendors. Then it is possible to run ChatGPT-Like LLM on Your PC Offline by some opensource LLM application, for example gpt4all. gpt4all needs GPU and large memory (128G+) to work. - gpt4all ChatGPT will be kept used for both offensive and defensive security. Attack against AI Before we discuss about attack against AI, let's briefly review how AI works. Research on AI has long history. However, generally people uses AI as a Machine Learning model or Deep Learning algorithms, and some of them uses Neural Network. In this article, we discuss about Deep Neural Network (DNN). DNN DNNs works as follows. At first, there are several nodes and one set of those are called nodes. Each nodes has it layer and the layer are connected each other. (Please see the pic below). The data from Input layer is going to propagate to multiple (hidden) layers and then finally reached to the Output layer, which performs classification or regression analysis. For example, input many pictures of animals to let the DNN learn, and then perform to identify (categorize) which animal is in the pictures. What kind of attacks are possible against AI? Threat of cyber security is to compromise the system's CIA (Confidentiality, Integrity, Availability). The attack to AI is to force wrong decisions (lose Integrity), make the AI unavailable (lose availability), or the decision model is theft (lose confidentiality). Among these attacking, the most well-known attack methodology is to input a noise in the input layer and force wrong decision - it is named as an Adversarial Example attack. Adversarial Example attack The Adversarial Example is illustrated in this paper in 2014: - Explaining and Harnessing Adversarial Examples The panda in the picture on the left side is the original data and be input to DNN - normally, the DNN will categorize this picture as panda obviously. However, if the attacker add a noise (middle picture), the DNN misjudge it as a gibbon. In other words, the attack on the AI is to make the AI make a wrong decision, without noticed by humans. The example above is attack to the image classifier. Another attack example is ShapeShifter, which attack to object detector. This makes a self-driving car with AI cause an accident without being noticed by humans, by makes stop signs undetectable. - ShapeShifter: Robust Physical Adversarial Attack on Faster R-CNN Object Detector Usually, a stop sign image is captured through a optical sensor in a self-driving car, and its object detector would recognise it as a stop sign and follow the instructions on the sign to stop. However, this attack would cause the car to fail to recognise the stop sign. You might think even if the DNN model on a self driving car is classified so the attacker can't get info to attack to the specific DNN model. However, the paper below discuss that an adversarial example designed for one model can transfer to other models as well (transferability). - Transferability Ranking of Adversarial Examples That means, even if an attacker is unable to examine the target DNN model, they can still experiment and attack by other DNN models. Data poisoning attack In an adversarial example attack, the data itself is not changed, instead, added noise to the data. The attack that poisoning the training data also exists. Data poisoning is to access to the training data which is used to learn/train the DNN model, and input incorrect data to make DNN model produce results which is profitable for the attacker, or reducing the accuracy of the learning. Inputting a backdoor is also possible. - Transferable Clean-Label Poisoning Attacks on Deep Neural Nets Reverse engineering attack Vulnerabilities in cryptography include a vulnerability that the attacker can learn the encryption model by analyzing the input/output strings which are easy to obtain. Similarly, in AI models, there is a possibility of reverse engineering of DNN models or copy the models by analysing the input (training data) and output (decision results). These papers discuss about that. - Reverse-Engineering Deep Neural Networks Using Floating-Point Timing Side-Channels - Copycat CNN　 Finally, there's one last thing I'd like to say. This article was not generated by ChatGPT.
Koichi
Jul 17, 2023 Place Technical Articles
2.4KViews
6likes
5Comments
Securing Generative AI: Defending the Future of Innovation and Creativity
Protect your organization's generative AI investments by mitigating security risks effectively. This comprehensive guide examines the assets of AI systems, analyzes potential threats, and offers actionable recommendations to strengthen security and maintain the integrity of your AI-powered applications.
Jordan_Zebor
Jul 19, 2023 Place Technical Articles
2.3KViews
7likes
2Comments
Securing the LLM User Experience with an AI Firewall
As artificial intelligence (AI) seeps into the core day-to-day operations of enterprises, a need exists to exert control over the intersection point of AI-infused applications and the actual large language models (LLMs) that answer the generated prompts. This control point should serve to impose security rules to automatically prevent issues such as personally identifiable information (PII) inadvertently exposed to LLMs. The solution must also counteract motivated, intentional misuse such as jailbreak attempts, where the LLM can be manipulated to provide often ridiculous answers with the ensuing screenshotting attempting to discredit the service. Beyond the security aspect and the overwhelming concern of regulated industries, other drivers include basic fiscal prudence 101, ensuring the token consumption of each offered LLM model is not out of hand. This entire discussion around observability and policy enforcement for LLM consumption has given rise to a class of solutions most frequently referred to as AI Firewalls or AI Gateways (AI GW). An AI FW might be leveraged by a browser plugin, or perhaps applying a software development kit (SDK) during the coding process for AI applications. Arguably, the most scalable and most easily deployed approach to inserting AI FW functionality into live traffic to LLMs is to use a reverse proxy. A modern approach includes the F5 Distributed Cloud service, coupled with an AI FW/GW service, cloud-based or self-hosted, that can inspect traffic intended for LLMs like those of OpenAI, Azure OpenAI, or privately operated LLMs like those downloaded from Hugging Face. A key value offered by this topology, a reverse proxy handing off LLM traffic to an AI FW, which in turn can allow traffic to reach target LLMs, stems from the fact that traffic is seen, and thus controllable, in both directions. Should an issue be present in a user’s submitted prompt, also known as an “inference”, it can be flagged: PII (Personally Identifiable Information) leakage is a frequent concern at this point. In addition, any LLM responses to prompts are also seen in the reverse path: consider a corrupted LLM providing toxicity in its generated replies. Not good. To achieve a highly performant reverse proxy approach to secured LLM access, a solution that can span a global set of users, F5 worked with Prompt Security to deploy an end-to-end AI security layer. This article will explore the efficacy and performance of the live solution. Impose LLM Guardrails with the AI Firewall and Distributed Cloud An AI firewall such as the Prompt Security offering can get in-line with AI LLM flows through multiple means. API calls from Curl or Postman can be modified to transmit to Prompt Security when trying to reach targets such as OpenAI or Azure OpenAI Service. Simple firewall rules can prevent employee direct access to these well-known API endpoints, thus making the Prompt Security route the sanctioned method of engaging with LLMs. A number of other methods could be considered but have concerns. Browser plug-ins have the advantage of working outside the encryption of the TLS layer, in a manner similar to how users can use a browser’s developer tools to clearly see targets and HTTP headers of HTTPS transactions encrypted on the wire. Prompt Security supports plugins. A downside, however, of browser plug-ins is the manageability issue, how to enforce and maintain across-the-board usage, simply consider the headache non-corporate assets used in the work environment. Another approach, interesting for non-browser, thick applications on desktops, think of an IDE like VSCode, might be an agent approach, whereby outbound traffic is handled by an on-board local proxy. Again, Prompt can fit in this model however the complexity of enforcement of the agent, like the browser approach, may not always be easy and aligned with complete A-to-Z security of all endpoints. One of the simplest approaches is to ingest LLM traffic through a network-centric approach. An F5 Distributed Cloud HTTPS load balancer, for instance, can ingest LLM-bound traffic, and thoroughly secure the traffic at the API layer, things like WAF policy and DDoS mitigations, as examples. HTTP-based control plane security is the focus here, as opposed to the encapsulated requests a user is sending to an LLM. The HTTPS load balancer can in turn hand off traffic intended for the likes of OpenAI to the AI gateway for prompt-aware inspections. F5 Distributed Cloud (XC) is a good architectural fit for inserting a third-party AI firewall service in-line with an organization’s inferencing requests. Simply project a FQDN for the consumption of AI services; in this article we used the domain name “llmsec.busdevF5.net” into the global DNS, advertising one single IP address mapping to the name. This DNS advertisement can be done with XC. The IP address, through BGP-4 support for anycast, will direct any traffic to this address to the closest of 27 international points of presence of the XC global fabric. Traffic from a user in Asia may be attracted to Singapore or Mumbai F5 sites, whereas a user in Western Europe might enter the F5 network in Paris or Frankfurt. As depicted, a distributed HTTPS load balancer can be configured – “distributed” reflects the fact traffic ingressing in any of the global sites can be intercepted by the load balancer. Normally, the server name indicator (SNI) value in the TLS Client Hello can be easily used to pick the correct load balancer to process this traffic. The first step in AI security is traditional reverse proxy core security features, all imposed by the XC load balancer. These features, to name just a few, might include geo-IP service policies to preclude traffic from regions, automatic malicious user detection, and API rate limiting; there are many capabilities bundled together. Clean traffic can then be selected for forwarding to an origin pool member, which is the standard operation of any load balancer. In this case, the Prompt Security service is the exclusive member of our origin pool. For this article, it is a cloud instantiated service - options exist to forward to Prompt implemented on a Kubernetes cluster or running on a Distributed Cloud AppStack Customer Edge (CE) node. Block Sensitive Data with Prompt Security In-Line AI inferences, upon reaching Prompt’s security service, are subjected to a wide breadth of security inspections. Some of the more important categories would include: Sensitive data leakage, although potentially contained in LLM responses, intuitively the larger proportion of risk is within the requesting prompt, with user perhaps inadvertently disclosing data which should not reach an LLM Source code fragments within submissions to LLMs, various programming languages may be scanned for and blocked, and the code may be enterprise intellectual property OWASP LLM top 10 high risk violations, such as LLM jailbreaking where the intent is to make the LLM behave and generate content that is not aligned with the service intentions; the goal may be embarrassing “screenshots”, such as having a chatbot for automobile vendor A actually recommend a vehicle from vendor B OWASP Prompt Injection detection, considered one of the most dangerous threats as the intention is for rogue users to exfiltrate valuable data from sources the LLM may have privileged access to, such as backend databases Token layer attacks, such as unauthorized and excessive use of tokens for LLM tasks, the so-called “Denial of Wallet” threat Content moderation, ensuring a safe interaction with LLMs devoid of toxicity, racial and gender discriminatory language and overall curated AI experience aligned with those productivity gains that LLMs promise To demonstrate sensitive data leakage protection, a Prompt Security policy was active which blocked LLM requests with, among many PII fields, a mailing address exposed. To reach OpenAI GPT3.5-Turbo, one of the most popular and cost-effective models in the OpenAI model lineup, prompts were sent to an F5 XC HTTPS load balancer at address llmsec.busdevf5.net. Traffic not violating the comprehensive F5 WAF security rules were proxied to the Prompt Security SaaS offering. The prompt below clearly involves a mailing address in the data portion. The ensuing prompt is intercepted by both the F5 and Prompt Security solutions. The first interception, the distributed HTTPS load balancer offered by F5 offers rich details on the transaction, and since no WAF rules or other security policies are violated, the transaction is forwarded to Prompt Security. The following demonstrates some of the interesting details surrounding the transaction, when completed (double-click to enlarge). As highlighted, the transaction was successful at the HTTP layer, producing a 200 Okay outcome. The traffic originated in the municipality of Ashton, in Canada, and was received into Distributed Cloud in F5’s Toronto (tr2-tor) RE site. The full details around the targeted URL path, such as the OpenAI /v1/chat/completions target and the user-agent involved, vscode-restclient, are both provided. Although the HTTP transaction was successful, the actual AI prompt was rejected, as hoped for, by Prompt Security. Drilling into the Activity Monitor in the Prompt UI, one can get a detailed verdict on the transaction (double-click). Following the yellow highlights above, the prompt was blocked, and the violation is “Sensitive Data”. The specific offending content, the New York City street address, is flagged as a precluded entity type of “mailing address”. Other fields that might be potentially blocking candidates with Prompt’s solution include various international passports or driver’s license formats, credit card numbers, emails, and IP addresses, to name but a few. A nice, time saving feature offered by the Prompt Security user interface is to simply choose an individual security framework of interest, such as GDPR or PCI, and the solution will automatically invoke related sensitive data types to detect. An important idea to grasp: The solution from Prompt is much more nuanced and advanced than simple REGEX; it invokes the power of AI itself to secure customer journeys into safe AI usage. Machine learning models, often transformer-based, have been fine-tuned and orchestrated to interpret the overall tone and tenor of prompts, gaining a real semantic understanding of what is being conveyed in the prompt to counteract simple obfuscation attempts. For instance, using printed numbers, such as one, two, three to circumvent Regex rules predicated on numerals being present - this will not succeed. This AI infused ability to interpret context and intent allows for preset industry guidelines for safe LLM enforcement. For instance, simply indicating the business sector is financial will allow the Prompt Security solution to pass judgement, and block if desired, financial reports, investment strategy documents and revenue audits, to name just a few. Similar awareness for sectors such as healthcare or insurance is simply a pull-down menu item away with the policy builder. Source Code Detection A common use case for LLM security solutions is identification and, potentially, blocking submissions of enterprise source code to LLM services. In this scenario, this small snippet of Python is delivered to the Prompt service: def trial(): return 2_500 <= sorted(choices(range(10_000), k=5))[2] < 7_500 sum(trial() for i in range(10_000)) / 10_000 A policy is in place for Python and JavaScript detection and was invoked as hoped for. curl --request POST \ --url https://llmsec.busdevf5.net/v1/chat/completions \ --header 'authorization: Bearer sk-oZU66yhyN7qhUjEHfmR5T3BlbkFJ5RFOI***********' \ --header 'content-type: application/json' \ --header 'user-agent: vscode-restclient' \ --data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "def trial():\n return 2_500 <= sorted(choices(range(10_000), k=5))[2] < 7_500\n\nsum(trial() for i in range(10_000)) / 10_000"}]}' Content Moderation for Interactions with LLMs One common manner of preventing LLM responses from veering into undesirable territory is for the service provider to implement a detailed system prompt, a set of guidelines that the LLM should be governed by when responding to user prompts. For instance, the system prompt might instruct the LLM to serve as polite, helpful and succinct assistant for customers purchasing shoes in an online e-commerce portal. A request for help involving the trafficking of narcotics should, intuitively, be denied. Defense in depth has traditionally meant no single point of failure. In the above scenario, screening both the user prompt and ensuring LLM response for a wide range of topics leads to a more ironclad security outcome. The following demonstrates some of the topics Prompt Security can intelligently seek out; in this simple example, the topic of “News & Politics” has been singled out to block as a demonstration. Testing can be performed with this easy Curl command, asking for a prediction on a possible election result in Canadian politics: curl --request POST \ --url https://llmsec.busdevf5.net/v1/chat/completions \ --header 'authorization: Bearer sk-oZU66yhyN7qhUjEHfmR5T3Blbk*************' \ --header 'content-type: application/json' \ --header 'user-agent: vscode-restclient' \ --data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "Who will win the upcoming Canadian federal election expected in 2025"}],"max_tokens": 250,"temperature": 0.7}' The response, available in the Prompt Security console, is also presented to the user. In this case, a Curl user leveraging the VSCode IDE. The response has been largely truncated for brevity, fields that are of interest is an HTTP “X-header” indicating the transaction utilized the F5 site in Toronto, and the number of tokens consumed in the request and response are also included. Advanced LLM Security Features Many of the AI security concerns are given prominence by the OWASP Top Ten for LLMs, an evolving and curated list of potential concerns around LLM usage from subject matter experts. Among these are prompt injection attacks and malicious instructions often perceived as benign by the LLM. Prompt Security uses a layered approach to thwart prompt injection. For instance, during the uptick in interest in ChatGPT, DAN (Do Anything Now) prompt injection was widespread and a very disruptive force, as discussed here. User prompts will be closely analyzed for the presence of the various DAN templates that have evolved over the past 18 months. More significantly, the use of AI itself allows the Prompt solution to recognize zero-day bespoke prompts attempting to conduct mischief. The interpretative powers of fine-tuned, purpose-built security inspection models are likely the only way to stay one step ahead of bad actors. Another chief concern is protection of the system prompt, the guidelines that reel in unwanted behavior of the offered LLM service, what instructed our LLM earlier in its role as a shoe sales assistant. The system prompt, if somehow manipulated, would be a significant breach in AI security, havoc could be created with an LLM directed astray. As such, Prompt Security offers a policy to compare the user provided prompt, the configured system prompt in the API call, and the response generated by the LLM. In the event that a similarity threshold with the system prompt is exceeded in the other fields, the transaction can be immediately blocked. An interesting advanced safeguard is the support for a “canary” word - a specific value that a well behaved LLM should never present in any response, ever. The detection of the canary word by the Prompt solution will raise an immediate alert. One particularly broad and powerful feature in the AI firewall is the ability to find secrets, meaning tokens or passwords, frequently for cloud-hosted services, that are revealed within user prompts. Prompt Security offers the ability to scour LLM traffic for in excess of 200 meaningful values. Just as a small representative sample of the industry’s breadth of secrets, these can all be detected and acted upon: Azure Storage Keys Detector Artifactory Detector Databricks API tokens GitLab credentials NYTimes Access Tokens Atlassian API Tokens Besides simple blocking, a useful redaction option can be chosen. Rather than risk compromise of credentials and obfuscated value will instead be seen at the LLM. F5 Positive Security Models for AI Endpoints The AI traffic delivered and received from Prompt Security’s AI firewall is both discovered and subjected to API layer policies by the F5 load balancer. Consider the token awareness features of the AI firewall, excessive token consumption can trigger an alert and even transaction blocking. This behavior, a boon when LLMs like the OpenAI premium GPT-4 models may have substantial costs, allows organizations to automatically shut down a malicious actor who illegitimately got hold of an OPENAI_API key value and bombarded the LLM with prompts. This is often referred to as a “Denial of Wallet” situation. F5 Distributed Cloud, with its focus upon the API layer, has congruent safeguards. Each unique user of an API service is tracked to monitor transactional consumption. By setting safeguards for API rate limiting, an excessive load placed upon the API endpoint will result in HTTP 429 “Too Many Request” in response to abusive behavior. A key feature of F5 API Security is the fact that it is actionable in both directions, and also an in-line offering, unlike some API solutions which reside out of band and consume proxy logs for reporting and threat detection. With the automatic discovery of API endpoints, as seen in the following screenshot, the F5 administrator can see the full URL path which in this case exercises the familiar OpenAI /v1/chat/completions endpoint. As highlighted by the arrow, the schema of traffic to API endpoints is fully downloadable as an OpenAPI Specification (OAS), formerly known as a Swagger file. This layer of security means fields in API headers and bodies can be validated for syntax, such that a field whose schema expects a floating-point number can see any different encoding, such as a string, blocked in real-time in either direction. A possible and valuable use case: allow an initial unfettered access to a service such as OpenAI, by means of Prompt Security’s AI firewall service, for a matter of perhaps 48 hours. After a baseline of API endpoints has been observed, the API definition can be loaded from any saved Swagger files at the end of this “observation” period. The loaded version can be fully pruned of undesirable or disallowed endpoints, all future traffic must conform or be dropped. This is an example of a “positive security model”, considered a gold standard by many risk-adverse organizations. Simply put, a positive security model allows what has been agreed upon through and rejects everything else. This ability to learn and review your own traffic, and then only present Prompt Security with LLM endpoints that an organization wants exposed is an interesting example of complementing an AI security solution with rich API layer features. Summary The world of AI and LLMs is rapidly seeing investment, in time and money, from virtually all economic sectors; the promise of rapid dividends in the knowledge economy is hard to resist. As with any rapid deployment of new technology, safe consumption is not guaranteed, and it is not built in. Although LLMs often suggest guardrails are baked into offerings, a 30-second search of the Internet will expose firsthand experiences where unexpected outcomes when invoking AI are real. Brand reputation is at stake and false information can be hallucinated or coerced out of LLMs by determined parties. By combining the ability to ingest globally at high-speed dispersed users and apply a first level of security protections, F5 Distributed Cloud can be leveraged as an onboarding for LLM workloads. As depicted in this article, Prompt Security can in turn handle traffic egressing F5’s distributed HTTPS load balancers and provide state-of-the-art AI safeguards, including sensitive data detection, content moderation and other OWASP-aligned mechanisms like jailbreak and prompt injection mitigation. Other deployment models exist, including deploying Prompt Security’s solution on-premises, self-hosted in cloud tenants, and running the solution on Distributed Cloud CE nodes themselves is supported.
Steve_Gorman
Jun 20, 2024 Place Technical Articles
2KViews
4likes
0Comments
Secure RAG for Safe AI Deployments Using F5 Distributed Cloud and NetApp ONTAP
Retrieval Augmented Generation (RAG) is one of the most discussed techniques to empower Large Language Models (LLM) to deliver niche, hyper-focused responses pertaining to specialized, sometimes proprietary, bodies of knowledge documents. Two simple examples might include highly detailed company-specific information distilled from years of financial internal reporting from financial controllers or helpdesk type queries with the LLM harvesting only relevant knowledge base (KB) articles, releases notes, and private engineering documents not normally exposed in their entirety. RAG is highly bantered about in numerous good articles; the two principal values are: LLM responses to prompts (queries) based upon specific, niche knowledge as opposed to the general, vast pre-training generic LLMs are taught with; in fact, it is common to instruct LLMs not to answer specifically with any pre-trained knowledge. Only the content “augmenting” the prompt. Attribution is a key deliverable with RAG. Generally LLM pre-trained knowledge inquiries are difficult to traceback to a root source of truth. Prompts augmented with specific assistive knowledge normally solicit responses that clearly call out the source of the answers provided. Why is the Security of RAG Source Content Particularly Important? To maximize the efficacy of LLM solutions in the realm of artificial intelligence (AI) an often-repeated adage is “garbage in, garbage out” which succinctly states an obvious fact with RAG: valuable and actionable items must be entered into the model to expect valuable, tactical outcomes. This means exposing key forms of data, examples being data which might include patented knowledge, intellectual property not to be exposed in raw form to competitors. Actual trade secrets, which will infuse the LLM but need to remain confidential in their native form. In one example around trade secrets, the Government of Canada spells out a series of items courts will look at in determining compensation for misuse (theft) of intellectual property. It is notable that the first item listed is not the cost associated with creation of the secret material (“the cost in money or time of creating or developing the information”) but rather the very first item is instead how much effort was made to keep the content secure (“the measures taken to maintain secrecy”). With RAG, incoming queries are augmented with rich, semantically similar enterprise content. The content has already been populated into a vector database by converting documents, they might be pdf or docx as examples, into raw text form and converting chunks of text into vectors. The vectors are long sequences of numbers with similar mathematical attributes for similar content. As a trivial example, one-word chunks such as glass, cup, bucket, jar might be semantically related, meaning similarities can be construed by both human minds and LLMs. On the other hand, empathy, joy, and thoughtfulness maintain similarities of their own. This semantic approach means a phrase/sentence/paragraph (chunk) using bow to mean “to bend in respect” will be highly distinct from chunks referring to the “front end of a ship" or “something to tie one’s hair back with”, even a tool every violinist would need. The list goes on; all semantic meanings of bow are very different in these chunks and would have distinctive embeddings within a vector database. The word embedding is likely derived from “fixing” or “planting” an object. In this case, words are “embedded” into a contextual understanding. The typical length of the number sequence describing the meaning of items has typically been more than 700, but this number of “dimensions” applied is always a matter of research, and the entire vector database is arrived at with an embedding LLM, distinct from the main LLM that will produce generative AI responses to our queries. Incoming queries destined for the main generative AI LLM can, in turn, be converted to vectors themselves by the very same text-embedding “helper” LLM and through retrieval (the “R” in RAG) similar textual content can buttress the prompt presented to the main LLM (double click to expand). Since a critical cog in the wheel of the RAG architecture is the ingestion of valuable and sensitive source documents into the vector database, using the embedding LLM, it is not just prudent but critical that this source content be brought securely over networks to the embedding engine. F5 Distributed Cloud Secure Multicloud Networking and NetApp ONTAP For many practical, time-to-market reasons, modern LLMs, both the main and embedding instances, may not be collocated with the data vaults of modern enterprises. LLMs benefit from cloud compute and GPU access, something often in short supply for on-premises production roll outs. A typical approach assisted by the economies of scale might be to harvest public cloud providers, such as Azure, AWS, and Google Cloud Platform for the compute side of AI projects. Azure, as one example, can turn up virtual machines with GPUs from NVIDIA like A100, A2, and Tesla T4 to name a few. The documents needed to feed an effective RAG solution may well be on-premises, and this is unlikely to change for reasons including governance, regulatory, and the weight of decades of sound security practice. One of the leading on-premises storage solutions of the last 25 years is the NetApp ONTAP storage appliance family, and reflected in this quote from NVIDIA: "Nearly half of the files in the world are stored on-prem on NetApp." — Jensen Huang, CEO of NVIDIA A key deliverable of F5 Distributed Cloud is providing encrypted interconnectivity of disparate physical sites and heterogeneous cloud instances such as Azure VNETs or AWS VPCs. As such, there are two immediate, concurrent F5 features that come to mind: Secure interconnectivity of on-premises NetApp volumes (NAS) or LUNs (Block) containing critical documents for ingestion into RAG. Utilize encrypted L3 connectivity between the enterprise location and the cloud instance where the LLM/RAG are instantiated. TCP load balancers are another alternative for volume sharing NAS protocols like NFS or SMB/CIFS. Secure access to the LLM web interface or RESTful API end points, with HTTPS load balancers including key features like WAF, anti-bot mechanisms, and API automatic rate limiting for abusive prompt sources. The following diagram presents the topology this article set out to create, REs are “regional edge” sites maintained internationally by F5 and harness private RE to RE, high-speed global communication links. DNS names, such as the target name of an LLM service, will leverage mappings to anycast IP addresses, thus users entering the RE network from southeast Asian might, for example, enter the Singapore RE while users in Switzerland might enter via a Paris or Frankfurt RE. Complementing the REs are Customer Edge (CE) nodes. These are virtual or physical appliances which act as security demarcation points. For instance, a CE placed in an Azure VNET can protect access to the server supporting the LLM, removing any need for Internet access to the server, which is now entirely accessible only through a private RFC-1918 type of private address. External access to the LLM for just employees or, maybe employees and contractors, or potentially access for the Internet community is enabled by a distributed HTTPS load balancer. In the example depicted above, oriented towards full Internet access, the FQDN of the LLM is projected by the load balancer into the global DNS and consumers of the service resolve the name to one IP address and are attracted to the closest RE by BGP-4’s support for anycast. As the name “distributed” load balancer suggests, the origin pool can be in an entirely different site than the incoming RE, in this case the origin pool is the LLM behind the CE in the Azure VNET. The LLM requests travel from RE to CE via a highspeed networking underlay. The portion of the solution that securely ties the LLM to the source content required for RAG to embed vectors is, in this case, utilizing layer 3 multicloud networking (MCN). The solution is turnkey, routing table are automatically connected to members of the L3 MCN, in this case the inside interfaces of the Azure CE and Redmond, Washington on-premises CE and traffic flows over an encrypted underlay network. As such, the NetApp ONTAP cluster can securely expose volumes with key file ware via a protocol like Network File System (NFS), no risk of data exposure to third-party prying eyes exists. The following diagram drills into the RE and CE and NetApp interplay (double click to expand). F5 Distributed Cloud App Connect and LLM Setup This article speaks to hands-on experience with web-driven LLM inferencing with augmented prompts derived from a RAG implementation. The AI compute was instantiated on an Azure-hosted Ubuntu 20.04 virtual machine with 4 virtual cores. Installed software included Python 3.10, and libraries such as Langchain, Pypdf (for converting pdf documents to text), FAISS (for similarity searching via a vector database), and other libraries. The actual open source LLM utilized for the generative AI is found here on huggingface.co. The binary, which exceeds 4 GB, is considered effective for CPU-based deployments. The embedding LLM model, critical to seed the vector database with entries derived from secured enterprise documentation, and then used again per incoming query for RAG similarity searches to build augmented prompts, was from Hugging Face: sentence-transformers/all-MiniLM-L6-v2 and can be found here. The AI RAG solution was implemented in Python3, and as such the Azure Ubuntu can be accessed both by SSH or via Jupyter Notebooks. The latter was utilized as this is the preferred final delivery mechanism for standard users, not a web chatbot design or the requirement to use API commands through solutions like Postman or Curl. This design choice, to steer the user experience towards Jupyter Notebook consumption, is in keeping with the fact that it has become a standard in AI LLM usage where the LLM is tactical and vital to an enterprise's lines of business (LOBs). Jupyter Notebooks are web-accessed with a browser like Chrome or Edge and as such, F5’s WAF, anti-bot, and L7 DDoS, all part of the F5 WAAP offering, can easily be laid upon an HTTP load balancer with a few mouse clicks in XC to provide premium security to the user experience. NetApp and F5 Distributed Cloud Secure Multicloud Networking The secure access to files for ingestion into the vector database, for similarity searches when user queries are received, makes use of an encrypted L3 Multicloud Network relationship between the Azure VNET and the LAN on prem in Redmond, Washington hosting the NetApp ONTAP cluster. The specific protocol chosen was NFS and the simplicity is demonstrated by the use of just one Linux command to present key, high-valued documents for the AI to populate the database: #mount -t nfs <IP Address of NetAPP LIF interface on-prem>:/Secure_docs_for_RAG /home/ubuntu_restriced_user/rag_project/docs/Secure_docs_for_RAG. This address is available nowhere else in the world except behind this F5 CE in the Azure VNET. After the pdf files are converted to text, chunked to reasonable sizes with some overlap suggested between the end of one chunk and the start of the next chunk, the embedding LLM will populate the vector database. The files are always only accessed remotely by NFS through the mounted volume, and this mount may be terminated until new documents are ready to be added to the solution. The Objective RAG Implementation - Described In order to have a reasonable facsimile of the real-word use cases this solution will empower today, but not having any sensitive documents to be injected, it was decided to use some seminal “Internet Boom”-era IETF Requests for Comment (RFCs) as source content. With the rise of multi-port routing and switching devices, it became apparent the industry badly needed specific and highly precise definitions around network device (router and switch) performance benchmarking to allow purchasers “apples-to-apples” comparisons. These documents recommend testing parameters, such as what frame or packet sizes to test with, test iteration time lengths, when to use FIFO vs LIFO vs LILO definitions of latency, etc. RFC-1242 (Request for Comment, terminology) and RFC-2544 (methodologies), chaired by Scott Bradner of Harvard University, and the later RFC 2285 (LAN switching terminologies), chaired by Bob Mandeville then of European Network Laboratories are three prominent examples, to which test and measurement solutions aspired to be compliant. Detailed LLM answers for quality assurance engineers in the network equipment manufacturing (NEM) space is the intended use case of the design, answers that must be distilled specifically by generative AI considering queries augmented by RAG and specifically only based upon these industry-approved documents. These documents are, of course, not containing trade secrets or patented engineering designs. They are in fact publicly available from the IETF, however they are nicely representative of the value offered in sensitive environments. Validating RAG – Watching the Context Provided to the LLM To ensure RAG was working, the content being augmented in the prompt was displayed to screen, we would expect to see relevant clauses and sentences from the RFCs being provided to the generative AI LLM. Also, if we were to start by asking questions that were outside the purview of this testing/benchmarking topic, we should see the LLM struggle to provide users a meaningful answer. To achieve this, rather than, say, asking what 802.3/Ethernetv2 frame sizes should be used in throughput measurements, and what precisely is the industry standard definition of the term “throughput” was, the question instead pertained to a recent Netflix release, featuring Lindsay Lohan. Due to the recency of the film, even if the LLM leaned upon its pretrained model, it will come up with nothing meaningful. “Question: Important, only use information provided as context in the prompt, do not use other trained knowledge. Please identify who played Heather in the March 2024 Lindsay Lohan Netflix movie titled Irish Wish?” As seen in the following Jupyter screenshot, the RAG solution can only provide augmented prompts from the database, in this case it has some test and measurement clauses and some rules pertaining to the winter ice sport of curling (double click to enlarge). The supportive context data augmenting the prompt was, as expected, not going to help in this off-the-wall query, as only the RFCs and some sports rules had been introduced to the vector database. What is especially noteworthy is the LLM did not simply admit it could not possibly answer this question but hallucinated a very articulate response that indicated “Heather played for her team against Lindsay Lohan’s Team Ireland”. To validate the RAG architecture for a working scenario, actually pertinent to the movie, a film review in pdf format was added to the ONTAP cluster-housed volume. The supportive documentation was consumed (using Pypdf from the Jupyter Notebook it became text). It was automatically chunked and introduced to the vector database through the Hugging Face embedding LLM. With the RAG vector database populated by the embedding LLM, the responses then became meaningful. The following exchange shows the value of RAG in producing meaningful, thought-out answers: Question: Important, only use information provided as context in the prompt, do not use other pre-trained knowledge. Who does Maddie edit novels for in the Lindsay Lohan March 2024 Netflix movie titled Irish Wish? Answer: Let's think step by step. We know that Maddie Kelly plays a book editor who harbors a big secret, and we have been given information about her job as an editor for Paul Kennedy (played by Alexander Vlahos), which is also part of the question context provided above. Therefore it seems logical to infer that she edits novels in general but specifically those written or co-written by author Paul Kennedy - who happens to be a bestselling romance novelist as stated earlier on, so we can safely conclude Maddie works for Paul himself (as opposed to any other book publishing company). This time we see relevant, meaningful context provided by RAG, with relevant portions of the pdf file review document augmenting the prompt. The augmentations are normally not displayed unless requested by the user or programmer, unlike humans, LLMs do not require superfluous text formatting (advantage: LLMs) and thus the content is packed tightly and efficiently. Fewer characters also mean fewer tokens get used by the targeted AI model and can allow more data before a token context limit is reached. In pay-for-use LLM approaches fewer tokens also help the enterprise’s bottom line financially. Also, note that the answer will likely not always be identical with subsequent asks of the same question as per LLM normal behavior. Features like “temperature setting” can also allow more “creative” ideas in responses, injecting humor and even outlandishness if desired. The RAG workflow is now validated, but the LLMs in question (embedding and main generative LLM) can still be made better with these suggestions: Increase “chunk” sizes so ideas are not lost when excessive breaks make for short chunks. Increase “overlap” so an idea/concept is not lost at the demarcation point of two chunks. Most importantly, provide more context from the vector database as context lengths (maximum tokens in a request/response) are generally increasing in size. Llama2, for instance, typically has a 4,096 context length but can now be used with larger values, such as 32,768. This article used only 3 augmentations to the user query, better results could be attained by increasing this value at a potential cost of more CPU cycles. Using Secure RAG – F5 L3 MCN, HTTPS Load Balancers and NetApp ONTAP Together With the RAG architecture validated to be working, the solution was used to assist the target user entering queries to the Azure server by means of Jupyter Notebooks, with RAG documents ingested over encrypted, private networking to the on-premises ONTAP cluster NFS volumes. The questions posed, which are answerable by reading and understanding key portions spread throughout the Scott Bradner RFCs, was: “Important, only use information provided as context in the prompt, do not use other pre-trained knowledge. Please explain the specific definition of throughput? What 802.3 frame sizes should be used for benchmarking? How long should each test iteration last? If you cannot answer the questions exclusively with the details included in the prompt, simply say you are unable to answer the question accurately. Thank you." The Jupyter Notebook representation of this query, which is made in the Python language and issued from the user’s local browser anywhere in the world and directly against the Azure-hosted LLM, looks like the following (click to expand image): The next screenshot demonstrates the result, based upon the provided secure documents (double click to expand). The response is decent, however, the fact that it is clearly using the provided augmentations to the prompt, that is the key objective of this article. The accuracy of the response can be questionable in some areas, the Bradner RFCs highlighted the importance of 64-byte 802.3/Ethernetv2 frame sizes in testing, as line rate forwarding with this minimum size produces the highest theoretically possible frame per second load. In the era of software driven forwarding in switches and routers this was very demanding. Sixty-four byte frames result in 14,881 fps (frames per second) for 10BaseT, 148,809 fps for 100BaseT, 1.48 million fps for Gigabit Ethernet. These values were frequently more aspirational in earlier times and also a frequent metric used in network equipment purchasing cycles. Suspiciously, the LLM response calls out 64kB in 802.3 testing, not 64B, something which seems to be an error. Again, with this architecture, the actual LLM providing the generative AI responses is increasingly viewed as a commodity, alternative LLMs can be plugged quickly and easily into the RAG approach of this Jupyter Notebook. The end user, and thus the enterprise itself, is empowered to utilize different LLMs, purchased or open-source from sites like Hugging Face, to determine optimal results. The other key change that can affect the overall accuracy of results is to experiment with different embedding models. In fact, there are on-line “leader” boards strictly for embedding LLMs so one can quickly swap in and out various popular embedding LLMs to see the impact on results. Summary and Conclusions on F5 and NetApp as Enablers for Secure RAG This article demonstrated an approach to AI usage that leveraged the compute and GPU availability that can be found today within cloud providers such as Azure. To safely access such an AI platform for a production-grade enterprise requirement, F5 Distributed Cloud (XC) provided HTTPS load balancers to connect worker browsers to a Jupyter Notebook service on the AI platform, this service applies advanced security upon the traffic within the XC, from WAF to anti-bot to L3/L7 DDOS protections. Utilizing secure Multicloud Networking (MCN), F5 provided a private L3 connectivity service between the inside interface on an Azure VNET-based CE (customer edge) node and the inside interface of an on-premises CE node in a building in Redmond, Washington. This secure network facilitated an NFS remote volume, content on spindles/flash in on-premises NetApp ONTAP to be remotely mounted on the Azure server. This secure file access provided peace of mind to exposing potentially critical and private materials from NetApp ONTAP volumes to the AI offering. RAG was configured and files were ingested, populating a vector database within the Azure server, that allowed details, ideas, and recommendations to be harnessed by a generative AI LLM by augmenting user prompts with text gleaned from the vector database. Simple examples were used to first demonstrate that RAG was working by posing queries that should not have been addressed by the loaded secure content; such a query was not suitably answered as expected. The feeding of meaningful content from ONTAP was then demonstrated to unleash the potential of AI to address queries based upon meaningful .pdf files. Opportunities to improve results by swapping in and out the main generative AI model, as well as the embedding model, were also considered.
Steve_Gorman
Apr 01, 2024 Place Technical Articles
1.8KViews
3likes
0Comments
Protect multi-cloud and Edge Generative AI applications with F5 Distributed Cloud
F5 Distributed Cloud capabilities allows customers to use a single platform for connectivity, application delivery and security of GenAI applications in any cloud location and at the Edge, with a consistent and simplified operational model, a game changer for streamlined operational experience for DevOps, NetOps and SecOps.
Valentin_Tobi
Mar 11, 2024 Place Technical Articles
1.5KViews
3likes
0Comments
F5 BIG-IP and NetApp StorageGRID - Providing Fast and Scalable S3 API for AI apps
F5 BIG-IP, an industry-leading ADC solution, can provide load balancing services for HTTPS servers, with full security applied in-flight and performance levels to meet any enterprise’s capacity targets. Specific to the S3 API, the object storage and retrieval protocol that rides upon HTTPS, an aligned partnering solution exists from NetApp, which allows a large-scale set of S3 API targets to ingest and provide objects. Automatic backend synchronization allows any node to be offered up as a target by a server load balancer like BIG-IP. This allows overall storage node utilization to be optimized across the node set, and scaled performance to reach the highest S3 API bandwidth levels, all while offering high availability to S3 API consumers. If one node fails or is undergoing maintenance, the overall service continues. S3 compatible storage is becoming popular for AI applications due to its superior performance over traditional protocols such as NFS or CIFS, as well as enabling repatriation of data from the cloud to on-prem. These are scenarios where the amount of data faced is large, this drives the requirement for new levels of scalability and performance; S3 compatible object storages such as NetApp StorageGRID are purpose-built to reach such levels. Sample BIG-IP and StorageGRID Configuration This document is based upon tests and measurements using the following lab configuration. All devices in the lab were virtual machine-based offerings. The S3 service to be projected to the outside world, depicted in the above diagram and delivered to the client via the external network, will use a BIG-IP virtual server (VS) which is tied to an origin pool of three large-capacity StorageGRID nodes. The BIG-IP maintains the integrity of the NetApp nodes by frequent HTTP-based health checks. Should an unhealthy node be detected, it will be dropped from the list of active pool members. When content is written via the S3 protocol to any node in the pool, the other members are synchronized to serve up content should they be selected by BIG-IP for future read requests. The key recommendations and observations in building the lab include: Setup a local certificate authority such that all nodes can be trusted by the BIG-IP. Typically the local CA-signed certificate will incorporate every node’s FQDN and IP address within the listed subject alternate names (SAN) to make the backend solution streamlined with one single certificate. Different F5 profiles, such as FastL4 or FastHTTP, can be selected to reach the right tradeoff between the absolute capacity of stateful traffic load-balanced versus rich layer 7 functions like iRules or authentication. Modern techniques such as multi-part uploads or using HTTP Ranges for downloads can take large objects, and concurrently move smaller pieces across the load balancer, lowering total transaction times, and spreading work over more CPU cores. The S3 protocol, at its core, is a set of REST API calls. To facilitate testing, the widely used S3Browser (www.s3browser.com) was used to quickly and intuitively create S3 buckets on the NetApp offering and send/retrieve objects (files) through the BIG-IP load balancer. Setup the BIG-IP and StorageGrid Systems The StorageGrid solution is an array of storage nodes, provisioned with the help of an administrative host, the “Grid Manager”. For interactive users, no thick client is required as on-board web services allow a streamlined experience all through an Internet browser. The following is an example of Grid Manager, taken from a Chrome browser; one sees the three Storage Nodes setup have been successfully added. The load balancer, in our case the BIG-IP, is set up with a virtual server to support HTTPS traffic and distributed that traffic, which is S3 object storage traffic, to the three StorageGRID nodes. The following screenshot demonstrates that the BIG-IP is setup in a standard HA (active-passive pair) configuration and the three pool members are healthy (green, health checks are fine) and receiving/sending S3 traffic, as the byte counts are seen in the image to be non-zero. On the internal side of the BIG-IP, TCP port 18082 is being used for S3 traffic. To do testing of the solution, including features such as multi-part uploads and downloads, a popular S3 tool, S3Browser, was downloaded and used. The following shows the entirety of the S3Browser setup. Simply create an account (StorageGRID-Account-01 in our example) and point the REST API endpoint at the BIG-IP Virtual Server that is acting as the secure front door for our pool of NetApp nodes. The S3 Access Key ID and Secret values are generated at turn-up time of the NetApp appliances. All S3 traffic will, of course, be SSL/TLS encrypted. BIG-IP will intercept the SSL traffic (high-speed decrypt) and then re-encrypt when proxying the traffic to a selected origin pool member. Other valid load balancer setups exist; one might include an “off load” approach to SSL, whereby the S3 nodes safely co-located in a data center may prefer to receive non-SSL HTTP S3 traffic. This may see an overall performance improvement in terms of peak bandwidth per storage node, but this comes at the tradeoff of security considerations. Experimenting with S3 Protocol and Load Balancing With all the elements in place to start understanding the behavior of S3 and spreading traffic across NetApp nodes, a quick test involved creating a S3 bucket and placing some objects in that new bucket. Buckets are logical collections of objects, conceptually not that different from folders or directories in file systems. In fact, a S3 bucket could even be mounted as a folder in an operating system such as Linux. In their simplest form, most commonly, buckets can simply serve as high-capacity, performant storage and retrieval targets for similarly themed structured or unstructured data. In the first test, we created a new bucket (“audio-clip-bucket”) and uploaded four sample files to the new bucket using S3Browser. We then zeroed the statistics for each pool member on the BIG-IP, to see if even this small upload would spread S3 traffic across more than a single NetApp device. Immediately after the upload, the counters reflect that two StorageGRID nodes were selected to receive S3 transactions. Richly detailed, per-transaction visibility can be obtained by leveraging the F5 SSL Orchestrator (SSLO) feature on the BIG-IP, whereby copies of the bi-directional S3 traffic decrypted within the load balancer can be sent to packet loggers, analytics tools, or even protocol analyzers like Wireshark. The BIG-IP also has an onboard analytics tool, Application Visibility and Reporting (AVR) which can provide some details on the nuances of the S3 traffic being proxied. AVR demonstrates the following characteristics of the above traffic, a simple bucket creation and upload of 4 objects. With AVR, one can see the URL values used by S3, which include the bucket name itself as well as transactions incorporating the object names as URLs. Also, the HTTP methods used included both GETS and PUTS. The use of HTTP PUT is expected when creating a new bucket. S3 is not governed by a typical standards body document, such as an IETF Request for Comment (RFC), but rather has evolved out of AWS and their use of S3 since 2006. For details around S3 API characteristics and nomenclature, this site can be referenced. For example, the expected syntax for creating a bucket is provided, including the fact that it should be an HTTP PUT to the root (/) URL target, with the bucket configuration parameters including name provided within the HTTP transaction body. Achieving High Performance S3 with BIG-IP and StorageGRID A common concern with protocols, such as HTTP, is head-of-line blocking, where one large, lengthy transaction blocks subsequent desired, and now queued transactions. This is one of the reasons for parallelism in HTTP 1.X, where loading 30 or more objects to paint a web page will often utilize two, four, or even more concurrent TCP sessions. Another performance issue when dealing with very large transactions is, without parallelism, even those most performant networks will see an established TCP session reach a maximum congestion window (CWND) where no more segments may be in put inflight until new TCP ACKs arrive back. Advanced TCP options like TCP exponential windowing or TCP SACK can help, but regardless of this, the achievable bandwidth of any one TCP session is bounded and may also frequently task only one core in multi-core CPUs. With the BIG-IP serving as the intermediary, large S3 transactions may default to “multi-part” uploads and downloads. The larger objects become a series of smaller objects that conveniently can be load-balanced by BIG-IP across the entire cluster of NetApp nodes. As displayed in the following diagram, we are asking for multi-part uploads to kick in for objects larger than 5 megabytes. After uploading a 20-megabyte file (technically, 20,000,000 bytes) the BIG-IP shows the traffic distributed across multiple NetApp nodes to the tune of 160.9 million bits. The incoming bits, incoming from the perspective of the origin pool members, confirm the delivery of the object with a small amount of protocol overhead (bits divided by eight to reach bytes). The value of load balancing manageable chunks of very large objects will pay dividends over time with faster overall transaction completion times due to the spreading of traffic across NetApp nodes, more TCP sessions reaching high congestion window values, and no single-core bottle necks in multicore equipment. Tuning BIG-IP for High Performance S3 Service Delivery The F5 BIG-IP offers a set of different profiles it can run its Local Traffic Manager (LTM) module in accordance with; LTM is the heart of the server load balancing function. The most performant profile in terms of attainable traffic load is the “FastL4” profile. This, and other profiles such as “OneConnect” or “FastHTTP”, can be tied to a virtual server, and details around each profile can be found here within the BIG-IP GUI: The FastL4 profile can increase virtual server performance and throughput for supported platforms by using the embedded Packet Velocity Acceleration (ePVA) chip to accelerate traffic. The ePVA chip is a hardware acceleration field programmable gate array (FPGA) that delivers high-performance L4 throughput by offloading traffic processing to the hardware acceleration chip. The BIG-IP makes flow acceleration decisions in software and then offloads eligible flows to the ePVA chip for that acceleration. For platforms that do not contain the ePVA chip, the system performs acceleration actions in software. Software-only solutions can increase performance in direct relationship to the hardware offered by the underlying host. As examples of BIG-IP virtual edition (VE) software running on mid-grade hardware platforms, results with Dell can be found here and similar experiences with HPE Proliant platforms are here. One thing to note about FastL4 as the profile to underpin a performance mode BIG-IP virtual server is that it is layer 4 oriented. For certain features that involve layer 7 HTTP related fields, such as using iRules to swap HTTP headers or perform HTTP authentication, a different profile might be more suitable. A bonus of FastL4 are some interesting specific performance features catering to it. In the BIG-IP version 17 release train, there is a feature to quickly tear down, with no delay, TCP sessions no longer required. Most TCP stacks implement TCP “2MSL” rules, where upon receiving and sending TCP FIN messages, the socket enters a lengthy TCP “TIME_WAIT” state, often minutes long. This stems back to historically bad packet loss environments of the very early Internet. A concern was high latency and packet loss might see incoming packets arrive at a target very late, and the TCP state machine would be confused if no record of the socket still existed. As such, the lengthy TIME_WAIT period was adopted even though this is consuming on-board resources to maintain the state. With FastL4, the “fast” close with TCP reset option now exists, such that any incoming TCP FIN message observed by BIG-IP will result in TCP RESETS being sent to both endpoints, normally bypassing TIME_WAIT penalties. OneConnect and FastHTTP Profiles As mentioned, other traffic profiles on BIG-IP are directed towards Layer 7 and HTTP features. One interesting profile is F5’s “OneConnect”. The OneConnect feature set works with HTTP Keep-Alives, which allows the BIG-IP system to minimize the number of server-side TCP connections by making existing connections available for reuse by other clients. This reduces, among other things, excessive TCP 3-way handshakes (Syn, Syn-Ack, Ack) and mitigates the small TCP congestion windows that new TCP sessions start with and only increases with successful traffic delivery. Persistent server-side TCP connections ameliorate this. When a new connection is initiated to the virtual server, if an existing server-side flow to the pool member is idle, the BIG-IP system applies the OneConnect source mask to the IP address in the request to determine whether it is eligible to reuse the existing idle connection. If it is eligible, the BIG-IP system marks the connection as non-idle and sends a client request over it. If the request is not eligible for reuse, or an idle server-side flow is not found, the BIG-IP system creates a new server-side TCP connection and sends client requests over it. The last profile considered is the “Fast HTTP” profile. The Fast HTTP profile is designed to speed up certain types of HTTP connections and again strives to reduce the number of connections opened to the back-end HTTP servers. This is accomplished by combining features from the TCP, HTTP, and OneConnect profiles into a single profile that is optimized for network performance. A resulting high performance HTTP virtual server processes connections on a packet-by-packet basis and buffers only enough data to parse packet headers. The performance HTTP virtual server TCP behavior operates as follows: the BIG-IP system establishes server-side flows by opening TCP connections to pool members. When a client makes a connection to the performance HTTP virtual server, if an existing server-side flow to the pool member is idle, the BIG-IP LTM system marks the connection as non-idle and sends a client request over the connection. Summary The NetApp StorageGRID multi-node S3 compatible object storage solution fits well with a high-performance server load balancer, thus making the F5 BIG-IP a good fit. S3 protocol can itself be adjusted to improve transaction response times, such as through the use of multi-part uploads and downloads, amplifying the default load balancing to now spread even more traffic chunks over many NetApp nodes. BIG-IP has numerous approaches to configuring virtual servers, from highest performance L4-focused profiles to similar offerings that retain L7 HTTP awareness. Lab testing was accomplished using the S3Browser utility and results of traffic flows were confirmed with both the standard BIG-IP GUI and the additional AVR analytics module, which provides additional protocol insight.
Steve_Gorman
Sep 16, 2024 Place Technical Articles
1.5KViews
3likes
0Comments
Run AI LLMs Centrally and Protect AI Inferencing with F5 Distributed Cloud API Security
This
Steve_Gorman
Mar 05, 2024 Place Technical Articles
1.2KViews
2likes
0Comments
Scalable AI Deployment: Harnessing OpenVINO and NGINX Plus for Efficient Inference
Introduction In the realm of artificial intelligence (AI) and machine learning (ML), the need for scalable and efficient AI inference solutions is paramount. As organizations deploy increasingly complex AI models to solve real-world problems, ensuring that these models can handle high volumes of inference requests becomes critical. NGINX Plus serves as a powerful ally in managing incoming traffic efficiently. As a high-performance web server and reverse proxy server, NGINX Plus is adept at load balancing and routing incoming HTTP and TCP traffic across multiple instances of AI model serving environments. The OpenVINO Model Server, powered by Intel's OpenVINO toolkit, is a versatile inference server supporting various deep learning frameworks and hardware acceleration technologies. It allows developers to deploy and serve AI models efficiently, optimizing performance and resource utilization. When combined with NGINX Plus capabilities, developers can create resilient and scalable AI inference solutions capable of handling high loads and ensuring high availability. Health checks allow NGINX Plus to continuously monitor the health of the upstream OVMS instances. If an OVMS instance becomes unhealthy or unresponsive, NGINX Plus can automatically route traffic away from it, ensuring that inference requests are processed only by healthy OVMS instances. Health checks provide real-time insights into the health status of OVMS instances. Administrators can monitor key metrics such as response time, error rate, and availability, allowing them to identify and address issues proactively before they impact service performance. In this article, we'll delve into the symbiotic relationship between the OpenVINO Model Server, and NGINX Plus to construct a robust and scalable AI inference solution. We'll explore setting up the environment, configuring the model server, harnessing NGINX Plus for load balancing, and conducting testing. By the end, readers will gain insights into how to leverage Docker, the OpenVINO Model Server, and NGINX Plus to build scalable AI inference systems tailored to their specific needs. Flow explanation: Now, let's walk through the flow of a typical inference request. When a user submits an image of a zebra for inference, the request first hits the NGINX load balancer. The load balancer then forwards the request to one of the available OpenVINO Model Server containers, distributing the workload evenly across multiple containers. The selected container processes the image using the optimized deep-learning model and returns the inference results to the user. In this case, the object is named zebra. OpenVINO™ Model Server is a scalable, high-performance solution for serving machine learning models optimized for Intel® architectures. The server provides an inference service via gRPC, REST API, or C API -- making it easy to deploy new algorithms and AI experiments. You can visit https://hub.docker.com/u/openvino for reference. Setting up: We'll begin by deploying model servers within containers. For this use case, I'm deploying the model server on a virtual machine (VM). Let's outline the steps to accomplish this: Get the docker image for OpenVINO ONNX run time docker pull openvino/onnxruntime_ep_ubuntu20 You can also visit https://docs.openvino.ai/nightly/ovms_docs_deploying_server.html for OpenVINO model server deployment in a container environment. Begin by creating a docker-compose file following the structure below: https://raw.githubusercontent.com/f5businessdevelopment/F5openVino/main/docker-compose.yml version: '3' services: resnet1: image: openvino/model_server:latest command: > --model_name=resnet --model_path=/models/resnet50 --layout=NHWC:NCHW --port=9001 volumes: - ./models:/models ports: - "9001:9001" resnet2: image: openvino/model_server:latest command: > --model_name=resnet --model_path=/models/resnet50 --layout=NHWC:NCHW --port=9002 volumes: - ./models:/models ports: - "9002:9002" # Add more services for additional containers resnet3: image: openvino/model_server:latest command: > --model_name=resnet --model_path=/models/resnet50 --layout=NHWC:NCHW --port=9003 volumes: - ./models:/models ports: - "9003:9003" resnet4: image: openvino/model_server:latest command: > --model_name=resnet --model_path=/models/resnet50 --layout=NHWC:NCHW --port=9004 volumes: - ./models:/models ports: - "9004:9004" resnet5: image: openvino/model_server:latest command: > --model_name=resnet --model_path=/models/resnet50 --layout=NHWC:NCHW --port=9005 volumes: - ./models:/models ports: - "9005:9005" resnet6: image: openvino/model_server:latest command: > --model_name=resnet --model_path=/models/resnet50 --layout=NHWC:NCHW --port=9006 volumes: - ./models:/models ports: - "9006:9006" resnet7: image: openvino/model_server:latest command: > --model_name=resnet --model_path=/models/resnet50 --layout=NHWC:NCHW --port=9007 volumes: - ./models:/models ports: - "9007:9007" resnet8: image: openvino/model_server:latest command: > --model_name=resnet --model_path=/models/resnet50 --layout=NHWC:NCHW --port=9008 volumes: - ./models:/models ports: - "9008:9008" Make sure you have Docker and Docker Compose installed on your system. Place your model files in the `./models/resnet50` directory on your local machine. Save the provided Docker Compose configuration to a file named `docker-compose.yml`. Run the following command in the directory containing the `docker-compose.yml` file to start the services: docker-compose up -d You can now access the OpenVINO Model Server instances using the specified ports (e.g., `http://localhost:9001` for `resnet1` and `http://localhost:9002` for `resnet2`). - Ensure that the model files are correctly placed in the `./models/resnet50` directory before starting the services. Set up an NGINX Plus proxy server. You can refer to https://docs.nginx.com/nginx/admin-guide/installing-nginx/installing-nginx-plus/ for NGINX Plus installation also You have the option to configure VMs with NGINX Plus on AWS by either: Utilizing the link provided below, which guides you through setting up NGINX Plus on AWS via the AWS Marketplace: NGINX Plus on AWS Marketplace or Following the instructions available on GitHub at the provided repository link. This repository facilitates spinning up VMs using Terraform on AWS and deploying VMs with NGINX Plus under the GitHub repository - F5 OpenVINO The NGINX Plus proxy server functions as a proxy for upstream model servers. Within the upstream block, backend servers (model_servers) are defined along with their respective IP addresses and ports. In the server block, NGINX listens on port 80 to handle incoming HTTP/2 requests targeting the specified server name or IP address. Requests directed to the root location (/) are then forwarded to the upstream model servers utilizing the gRPC protocol. The proxy_set_header directives are employed to maintain client information integrity while passing requests to the backend servers. Ensure to adjust the IP addresses, ports, and server names according to your specific setup. Here is an example configuration that is also available at GitHub https://github.com/f5businessdevelopment/F5openVino upstream model_servers { server 172.17.0.1:9001; server 172.17.0.1:9002; server 172.17.0.1:9003; server 172.17.0.1:9004; server 172.17.0.1:9005; server 172.17.0.1:9006; server 172.17.0.1:9007; server 172.17.0.1:9008; zone model_servers 64k; } server { listen 80 http2; server_name 10.0.0.19; # Replace with your domain or public IP location / { grpc_pass grpc://model_servers; health_check type=grpc grpc_status=12; # 12=unimplemented proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } } If you are using gRPC with SSL please refer to the detailed configuration at NGINX Plus SSL Configuration Here is the explanation: upstream model_servers { server 172.17.0.1:9001; # Docker bridge network IP and port for your container server 172.17.0.1:9002; # Docker bridge network IP and port for your container .... .... } This section defines an upstream block named model_servers, which represents a group of backend servers. In this case, there are two backend servers defined, each with its IP address and port. These servers are typically the endpoints that NGINX will proxy requests to. server { listen 80 http2; server_name 10.1.1.7; # Replace with your domain or public IP This part starts with the main server block. It specifies that NGINX should listen for incoming connections on port 80 using the HTTP/2 protocol (http2), and it binds the server to the IP address 10.1.1.7. Replace this IP address with your domain name or public IP address. location / { grpc_pass grpc://model_servers; health_check type=grpc grpc_status=12; # 12=unimplemented proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } Within the location/block, NGINX defines how to handle requests to the root location. In this case, it's using gRPC (grpc_pass grpc://model_servers;) to pass the requests to the upstream servers defined in the model_servers block. The proxy_set_header directives are used to set headers that preserve client information when passing requests to the backend servers. These headers include Host, X-Real-IP, and X-Forwarded-For. Health checks with type=grpc enable granular monitoring of individual gRPC services and endpoints. You can verify the health of specific gRPC methods or functionalities, ensuring each service component is functioning correctly. In summary, this NGINX configuration sets up a reverse proxy server that listens for HTTP/2 requests on port 80 and forwards them to backend servers (model_servers) using the gRPC protocol. It's commonly used for load balancing or routing requests to multiple backend servers. Inference Testing: This is how you can conduct testing. On the client side, we utilize a script named predict.py. Below is the script for reference # Import necessary libraries import numpy as np from classes import imagenet_classes from ovmsclient import make_grpc_client # Create a gRPC client to communicate with the server # Replace "10.1.1.7:80" with the IP address and port of your server client = make_grpc_client("10.1.1.7:80") # Open the image file "zebra.jpeg" in binary read mode with open("zebra.jpeg", "rb") as f: img = f.read() # Send the image data to the server for prediction using the "resnet" model output = client.predict({"0": img}, "resnet") # Extract the index of the predicted class with the highest probability result_index = np.argmax(output[0]) # Print the predicted class label using the imagenet_classes dictionary print(imagenet_classes[result_index]) This script imports necessary libraries, establishes a connection to the server at the specified IP address and port, reads an image file named "zebra.jpeg," sends the image data to the server for prediction using the "resnet" model, retrieves the predicted class index with the highest probability, and prints the corresponding class label. Results: Execute the following command from the client machine. Here, we are transmitting this image of Zebra to the model server. python3 predict.py zebra.jpg #run the Inference traffic zebra. The prediction output is 'zebra'. Let's now examine the NGINX Plus logs cat /var/log/nginx/access.log 10.1.1.7 - - [13/Apr/2024:00:18:52 +0000] "POST /tensorflow.serving.PredictionService/Predict HTTP/2.0" 200 4033 "-" "grpc-python/1.62.1 grpc-c/39.0.0 (linux; chttp2)" This log entry shows that a POST request was made to the NGINX server at the specified timestamp, and the server responded with a success status code (200). The request was made using gRPC, as indicated by the user agent string. Conclusion : Using NGINX Plus, organizations can achieve a scalable and efficient AI inference solution. NGINX Plus can address disruptions caused by connection timeouts/errors, sudden spikes in request rates, or changes in network topology. OpenVINO Model Server optimizes model performance and inference speed, utilizing Intel hardware acceleration for enhanced efficiency. NGINX Plus acts as a high-performance load balancer, distributing incoming requests across multiple model server instances for improved scalability and reliability. Together, this enables seamless scaling of AI inference workloads, ensuring optimal performance and resource utilization. You can look at this video for reference: https://youtu.be/Sd99woO9FmQ References: https://hub.docker.com/u/openvino https://docs.nginx.com/nginx/deployment-guides/amazon-web-services/high-availability-keepalived/ https://www.nginx.com/blog/nginx-1-13-10-grpc/ https://github.com/f5businessdevelopment/F5openVino.git https://docs.openvino.ai/nightly/ovms_docs_deploying_server.html
Sanjay_Shitole
Apr 24, 2024 Place Technical Articles
1.1KViews
1like
0Comments