AI
24 TopicsEnhance your GenAI chatbot with the power of Agentic RAG and F5 platform
Agentic RAG (Retrieval-Augmented Generation) enhances the capabilities of a GenAI chatbot by integrating dynamic knowledge retrieval into its conversational abilities, making it more context-aware and accurate. In this demo, I will demonstrate an autonomous decision-making GenAI chatbot utilizing Agentic RAG. I will explore what Agentic RAG is and why it's crucial in today's AI landscape. I will also discuss how organizations can leverage GPUaaS (GPU as a Service) or AI Factory providers to accelerate their AI strategy. F5 platform provides robust security features that protect sensitive data while ensuring high availability and performance. They optimize the chatbot by streamlining traffic management and reducing latency, ensuring smooth interactions even during high demand. This integration ensures the GenAI chatbot is not only smart but also reliable and secure for enterprise use.341Views1like0CommentsMitigate OWASP LLM Security Risk: Sensitive Information Disclosure Using F5 NGINX App Protect
This short WAF security article covered the critical security gaps present in current generative AI applications, emphasizing the urgent need for robust protection measures in LLM design deployments. Finally we also demonstrated how F5 Nginx App Protect v5 offers an effective solution to mitigate the OWASP LLM Top 10 risks.244Views2likes0CommentsHow to Prepare Your Network Infrastructure to Add HPC Clusters for AI to Your Data Center
HPC AI clusters are getting deployed as highly-engineered 'lego blocks' which are opaque to established data center operations and standards. By taking advantage of established Kubernetes based networking solutions that provide high-speed intelligent networking, you can save yourself from expensive cost overruns, data center re-auditing, and delays. By using Kubernetes based solutions which take advantage of the high-speed networking solutions already required by HP AI deployments, you further optimize your investment in AI.188Views4likes0CommentsF5 BIG-IP and NetApp StorageGRID - Providing Fast and Scalable S3 API for AI apps
F5 BIG-IP, an industry-leading ADC solution, can provide load balancing services for HTTPS servers, with full security applied in-flight and performance levels to meet any enterprise’s capacity targets. Specific to S3 API, the object storage and retrieval protocol that rides upon HTTPS, an aligned partnering solution exists from NetApp, which allows a large-scale set of S3 API targets to ingest and provide objects. Automatic backend synchronization allows any node to be offered up at a target by a server load balancer like BIG-IP. This allows overall storage node utilization to be optimized across the node set, and scaled performance to reach the highest S3 API bandwidth levels, all while offering high availability to S3 API consumers. S3 compatible storage is becoming popular for AI applications due to its superior performance over traditional protocols such as NFS or CIFS, as well as enabling repatriation of data from the cloud to on-prem. These are scenarios where the amount of data faced is large, this drives the requirement for new levels of scalability and performance; S3 compatible object storages such as NetApp StorageGRID are purpose-built to reach such levels. Sample BIG-IP and StorageGRID Configuration This document is based upon tests and measurements using the following lab configuration. All devices in the lab were virtual machine-based offerings. The S3 service to be projected to the outside world, depicted in the above diagram and delivered to the client via the external network, will use a BIG-IP virtual server (VS) which is tied to an origin pool of three large-capacity StorageGRID nodes. The BIG-IP maintains the integrity of the NetApp nodes by frequent HTTP-based health checks. Should an unhealthy node be detected, it will be dropped from the list of active pool members. When content is written via the S3 protocol to any node in the pool, the other members are synchronized to serve up content should they be selected by BIG-IP for future read requests. The key recommendations and observations in building the lab include: Setup a local certificate authority such that all nodes can be trusted by the BIG-IP. Typically the local CA-signed certificate will incorporate every node’s FQDN and IP address within the listed subject alternate names (SAN) to make the backend solution streamlined with one single certificate. Different F5 profiles, such as FastL4 or FastHTTP, can be selected to reach the right tradeoff between the absolute capacity of stateful traffic load-balanced versus rich layer 7 functions like iRules or authentication. Modern techniques such as multi-part uploads or using HTTP Ranges for downloads can take large objects, and concurrently move smaller pieces across the load balancer, lowering total transaction times, and spreading work over more CPU cores. The S3 protocol, at its core, is a set of REST API calls. To facilitate testing, the widely used S3Browser (www.s3browser.com) was used to quickly and intuitively create S3 buckets on the NetApp offering and send/retrieve objects (files) through the BIG-IP load balancer. Setup the BIG-IP and StorageGrid Systems The StorageGrid solution is an array of storage nodes, provisioned with the help of an administrative host, the “Grid Manager”. For interactive users, no thick client is required as on-board web services allow a streamlined experience all through an Internet browser. The following is an example of Grid Manager, taken from a Chrome browser; one sees the three Storage Nodes setup have been successfully added. The load balancer, in our case, the BIG-IP, is set up with a virtual server to support HTTPS traffic and distributed that traffic, which is S3 object storage traffic, to the three StorageGRID nodes. The following screenshot demonstrates that the BIG-IP is setup in a standard HA (active-passive pair) configuration and the three pool members are healthy (green, health checks are fine) and receiving/sending S3 traffic, as the byte counts are seen in the image to be non-zero. On the internal side of the BIG-IP, TCP port 18082 is being used for S3 traffic. To do testing of the solution, including features such as multi-part uploads and downloads, a popular S3 tool, S3Browser, was downloaded and used. The following shows the entirety of the S3Browser setup. Simply create an account (StorageGRID-Account-01 in our example) and point the REST API endpoint at the BIG-IP Virtual Server that is acting as the secure front door for our pool of NetApp nodes. The S3 Access Key ID and Secret values are generated at turn-up time of the NetApp appliances. All S3 traffic will, of course, be SSL/TLS encrypted. BIG-IP will intercept the SSL traffic (high-speed decrypt) and then re-encrypt when proxying the traffic to a selected origin pool member. Other valid load balancer setups exist; one might include an “off load” approach to SSL, whereby the S3 nodes safely co-located in a data center may prefer to receive non-SSL HTTP S3 traffic. This may see an overall performance improvement in terms of peak bandwidth per storage node, but this comes at the tradeoff of security considerations. Experimenting with S3 Protocol and Load Balancing With all the elements in place to start understanding the behavior of S3 and spreading traffic across NetApp nodes, a quick test involved creating a S3 bucket and placing some objects in that new bucket. Buckets are logical collections of objects, conceptually not that different from folders or directories in file systems. In fact, a S3 bucket could even be mounted as a folder in an operating system such as Linux. In their simplest form, most commonly, buckets can simply serve as high-capacity, performant storage and retrieval targets for similarly themed structured or unstructured data. In the first test, we created a new bucket (“audio-clip-bucket”) and uploaded four sample files to the new bucket using S3Browser. We then zeroed the statistics for each pool member on the BIG-IP, to see if even this small upload would spread S3 traffic across more than a single NetApp device. Immediately after the upload, the counters reflect that two StorageGRID nodes were selected to receive S3 transactions. Richly detailed, per-transaction visibility can be obtained by leveraging the F5 SSL Orchestrator (SSLO) feature on the BIG-IP, whereby copies of the bi-directional S3 traffic decrypted within the load balancer can be sent to packet loggers, analytics tools, or even protocol analyzers like Wireshark. The BIG-IP also has an onboard analytics tool, Application Visibility and Reporting (AVR) which can provide some details on the nuances of the S3 traffic being proxied. AVR demonstrates the following characteristics of the above traffic, a simple bucket creation and upload of 4 objects. With AVR, one can see the URL values used by S3, which include the bucket name itself as well as transactions incorporating the object names at URLs. Also, the HTTP methods used included both GETS and PUTS. The use of HTTP PUT is expected when creating a new bucket. S3 is not governed by a typical standards body document, such as an IETF Request for Comment (RFC), but rather has evolved out of AWS and their use of S3 since 2006. For details around S3 API characteristics and nomenclature, this site can be referenced. For example, the expected syntax for creating a bucket is provided, including the fact that it should be an HTTP PUT to the root (/) URL target, with the bucket configuration parameters including name provided within the HTTP transaction body. Achieving High Performance S3 with BIG-IP and StorageGRID A common concern with protocols, such as HTTP, is head-of-line blocking, where one large, lengthy transaction blocks subsequent desired, queued transactions. This is one of the reasons for parallelism in HTTP, where loading 30 or more objects to paint a web page will often utilize two, four, or even more concurrent TCP sessions. Another performance issue when dealing with very large transactions is, without parallelism, even those most performant networks will see an established TCP session reach a maximum congestion window (CWND) where no more segments may be in flight until new TCP ACKs arrive back. Advanced TCP options like TCP exponential windowing or TCP SACK can help, but regardless of this, the achievable bandwidth of any one TCP session is bounded and may also frequently task only one core in multi-core CPUs. With the BIG-IP serving as the intermediary, large S3 transactions may default to “multi-part” uploads and downloads. The larger objects become a series of smaller objects that conveniently can be load-balanced by BIG-IP across the entire cluster of NetApp nodes. As displayed in the following diagram, we are asking for multi-part uploads to kick in for objects larger than 5 megabytes. After uploading a 20-megabyte file (technically, 20,000,000 bytes) the BIG-IP shows the traffic distributed across multiple NetApp nodes to the tune of 160.9 million bits. The incoming bits, incoming from the perspective of the origin pool members, confirm the delivery of the object with a small amount of protocol overhead (bits divided by eight to reach bytes). The value of load balancing manageable chunks of very large objects will pay dividends over time with faster overall transaction completion times due to the spreading of traffic across NetApp nodes, more TCP sessions reaching high congestion window values, and no single-core bottle necks in multicore equipment. Tuning BIG-IP for High Performance S3 Service Delivery The F5 BIG-IP offers a set of different profiles it can run via its Local Traffic Manager (LTM) module in accordance with; LTM is the heart of the server load balancing function. The most performant profile in terms of attainable traffic load is the “FastL4” profile. This, and other profiles such as “OneConnect” or “FastHTTP”, can be tied to a virtual server, and details around each profile can be found here within the BIG-IP GUI: The FastL4 profile can increase virtual server performance and throughput for supported platforms by using the embedded Packet Velocity Acceleration (ePVA) chip to accelerate traffic. The ePVA chip is a hardware acceleration field programmable gate array (FPGA) that delivers high-performance L4 throughput by offloading traffic processing to the hardware acceleration chip. The BIG-IP makes flow acceleration decisions in software and then offloads eligible flows to the ePVA chip for that acceleration. For platforms that do not contain the ePVA chip, the system performs acceleration actions in software. Software-only solutions can increase performance in direct relationship to the hardware offered by the underlying host. As examples of BIG-IP virtual edition (VE) software running on mid-grade hardware platforms, results with Dell can be found here and similar experiences with HPE Proliant platforms are here. One thing to note about FastL4 as the profile to underpin a performance mode BIG-IP virtual server is that it is layer 4 oriented. For certain features that involve layer 7 HTTP related fields, such as using iRules to swap HTTP headers or perform HTTP authentication, a different profile might be more suitable. A bonus of FastL4 are some interesting specific performance features catering to it. In the BIG-IP version 17 release train, there is a feature to quickly tear down, with no delay, TCP sessions no longer required. Most TCP stacks implement TCP “2MSL” rules, where upon receiving and sending TCP FIN messages, the socket enters a lengthy TCP “TIME_WAIT” state, often minutes long. This stems back to historically bad packet loss environments of the very early Internet. A concern was high latency and packet loss might see incoming packets arrive at a target very late, and the TCP state machine would be confused if no record of the socket still existed. As such, the lengthy TIME_WAIT period was adopted even though this is consuming on-board resources to maintain the state. With FastL4, the “fast” close with TCP reset option now exists, such that any incoming TCP FIN message observed by BIG-IP will result in TCP RESETS being sent to both endpoints, normally bypassing TIME_WAIT penalties. OneConnect and FastHTTP Profiles As mentioned, other traffic profiles on BIG-IP are directed towards Layer 7 and HTTP features. One interesting profile is F5’s “OneConnect”. The OneConnect feature set works with HTTP Keep-Alives, which allows the BIG-IP system to minimize the number of server-side TCP connections by making existing connections available for reuse by other clients. This reduces, among other things, excessive TCP 3-way handshakes (Syn, Syn-Ack, Ack) and mitigates the small TCP congestion windows that new TCP sessions start with and only increases with successful traffic delivery. Persistent server-side TCP connections ameliorate this. When a new connection is initiated to the virtual server, if an existing server-side flow to the pool member is idle, the BIG-IP system applies the OneConnect source mask to the IP address in the request to determine whether it is eligible to reuse the existing idle connection. If it is eligible, the BIG-IP system marks the connection as non-idle and sends a client request over it. If the request is not eligible for reuse, or an idle server-side flow is not found, the BIG-IP system creates a new server-side TCP connection and sends client requests over it. The last profile considered is the “Fast HTTP” profile. The Fast HTTP profile is designed to speed up certain types of HTTP connections and again strives to reduce the number of connections opened to the back-end HTTP servers. This is accomplished by combining features from the TCP, HTTP, and OneConnect profiles into a single profile that is optimized for network performance. A resulting high performance HTTP virtual server processes connections on a packet-by-packet basis and buffers only enough data to parse packet headers. The performance HTTP virtual server TCP behavior operates as follows: the BIG-IP system establishes server-side flows by opening TCP connections to pool members. When a client makes a connection to the performance HTTP virtual server, if an existing server-side flow to the pool member is idle, the BIG-IP LTM system marks the connection as non-idle and sends a client request over the connection. Summary The NetApp StorageGRID multi-node S3 compatible object storage solution fits well with a high-performance server load balancer, thus making the F5 BIG-IP a good fit. S3 protocol can itself be adjusted to improve transaction response times, such as through the use of multi-part uploads and downloads, amplifying the default load balancing to now spread even more traffic chunks over many NetApp nodes. BIG-IP has numerous approaches to configuring virtual servers, from highest performance L4-focused profiles to similar offerings that retain L7 HTTP awareness. Lab testing was accomplished using the S3Browser utility and results of traffic flows were confirmed with both the standard BIG-IP GUI and the additional AVR analytics module, which provides additional protocol insight.300Views3likes0CommentsSSL Orchestrator Advanced Use Cases: Detecting Generative AI
Introduction Quick, take a look at the following list and answer this question: "What do these movies have in common?" 2001: A Space Odyssey Westworld Tron WarGames Electric Dreams The Terminator The Matrix Eagle Eye Ex Machina Avengers: Age of Ultron M3GAN If you answered, "They're all about artificial intelligence", yes, but... If you answered, "They're all about artificial intelligence that went terribly, sometimes horribly wrong", you'd be absolutely correct. The simple fact is...artificial intelligence (AI) can be scary. Proponents for, and opponents against will disagree on many aspects, but they can all at least acknowledge there's a handful of ways to do AI correctly...and a million ways to do it badly. Not to be an alarmist, but whileSkyNet was fictional, semi-autonomousguns on robot dogs is not... But then why am I talking about this on a technical forum you may ask? Well, when most of the above films were made, AI was largely still science fiction. That's clearly not the case anymore, and tools like ChatGPT are just the tip of the coming AI frontier. To be fair, I don't make the claim that all AI is bad, and many have indeed lauded ChatGPT and other generative AI tools as the next great evolution in technology. But it's also fair to say that generative AI tools, like ChatGPT, have a very real potential to cause harm. At the very least, these tools can be convincing, even when they're wrong. And worse, they could lead to sensitive information disclosures. One only has to do a cursory search to find a few examples of questionable behavior: Lawyers File Motion Written by AI, Face Sanctions and Possible Disbarment Higher Ed Beware: 10 Dangers of ChatGPT Schools Need to Know ChatGPT and AI in the Workplace: Should Employers Be Concerned? OpenAI's New Chatbot Will Tell You How to Shoplift and Make Explosives Giant Bank JP Morgan Bans ChatGPT Use Among Employees Samsung Bans ChatGPT Among Employees After Sensitive Code Leak But again...what does this have to do with a technical forum? And more important, what does this have to do with you? Simply stated, if you are in an organization where generative AI toolscould be abused, understanding, and optionally controlling how and when these tools are accessed, could help to prevent the next big exploit or disclosure. If you search beyond the above links, you'll find an abundance of information on both the benefits, and security concerns of AI technologies. And ultimately you'll still be left to decide if these AI tools are safe for your organization. It may simply be worthwhile to understand WHAT tools are being used. And in some cases, it may be important to disable access to these. Given the general depth and diversity of AI functions within arms-reach today, and growing, it'd be irresponsible to claim "complete awareness". The bulk of these functions are delivered over standard HTTPS, so the best course of action will be to categorize on known assets, and adjust as new ones come along. As of the publishing of this article, the industry has yet to define a standard set of categories for AI, and specifically, generative AI. So in this article, we're going to build one and attach that to F5 BIG-IP SSL Orchestrator to enable proactive detection and optional control of Internet-based AI tool access in your organization. Let's get started! BIG-IP SSL Orchestrator Use Case: Detecting Generative AI The real beauty of this solution is that it can be implemented faster than it probably took to read the above introduction. Essentially, you're going to create a custom URL category on F5 BIG-IP, populate that with known generative AI URLs, and employ that custom category in a BIG-IP SSL Orchestrator security policy rule. Within that policy rule, you can elect to dynamically decrypt and send the traffic to the set of inspection products in your security enclave. Step 1: Create the custom URL category and populate with known AI URLs - Access the BIG-IP command shell and run the following command. This will initiate a script that creates and populates the URL category: curl -s https://raw.githubusercontent.com/f5devcentral/sslo-script-tools/main/sslo-generative-ai-categories/sslo-create-ai-category.sh |bash Step 2: Create a BIG-IP SSL Orchestrator policy rule to use this data - The above script creates/re-populates a custom URL category named SSLO_GENERATIVE_AI_CHAT, and inthat category is a set of known generative AI URLs. To use, navigate to the BIG-IP SSL Orchestrator UI and edit a Security Policy. Click add to create a new rule, use the "Category Lookup (All)" policy condition, then add the above URL category. Set the Action to "Allow", SSL Proxy Action to "Intercept", and Service Chain to whatever service chain you've already created. With Summary Logging enabled in the BIG-IP SSL Orchestrator topology configuration, you'll also get Syslog reporting for each AI resource match - who made the request, to what, and when. The URL category is employed here to identify known AI tools. In this instance, BIG-IP SSL Orchestrator is used to make that assessment and act on it (i.e. allow, TLS intercept, service chain, log). Should you want even more granular control over conditions and actions of the decrypted AI tool traffic, you can also deploy an F5 Secure Web Gateway Services policy inside the SSL Orchestrator service chain. With SWG, you can expand beyond simple detection and blocking, and build more complex rules to decide who can access, when, and how. It should be said that beyond logging, allowing, or denying access to generative AI tools, SSL Orchestrator is also going to provide decryption and the opportunity to dynamically steer the decrypted AI traffic to any set of security products best suited to protect against any potential malware. Summary As previously alluded, this is not an exhaustive list of AI tool URLs. Not even close. But it contains the most common you'll see in the wild. The above script populates with an initial list of URLs that you are free to update as you become aware of new one. And of course we invite you to recommend additional AI tools to add to this list. References:https://github.com/f5devcentral/sslo-script-tools/tree/main/sslo-generative-ai-categories1.9KViews4likes1CommentSecuring the LLM User Experience with an AI Firewall
As artificial intelligence (AI) seeps into the core day-to-day operations of enterprises, a need exists to exert control over the intersection point of AI-infused applications and the actual large language models (LLMs) that answer the generated prompts. This control point should serve to impose security rules to automatically prevent issues such as personally identifiable information (PII) inadvertently exposed to LLMs. The solution must also counteract motivated, intentional misuse such as jailbreak attempts, where the LLM can be manipulated to provide often ridiculous answers with the ensuing screenshotting attempting to discredit the service. Beyond the security aspect and the overwhelming concern of regulated industries, other drivers include basic fiscal prudence 101, ensuring the token consumption of each offered LLM model is not out of hand. This entire discussion around observability and policy enforcement for LLM consumption has given rise to a class of solutions most frequently referred to as AI Firewalls or AI Gateways (AI GW). An AI FW might be leveraged by a browser plugin, or perhaps applying a software development kit (SDK) during the coding process for AI applications. Arguably, the most scalable and most easily deployed approach to inserting AI FW functionality into live traffic to LLMs is to use a reverse proxy. A modern approach includes the F5 Distributed Cloud service, coupled with an AI FW/GW service, cloud-based or self-hosted, that can inspect traffic intended for LLMs like those of OpenAI, Azure OpenAI, or privately operated LLMs like those downloaded from Hugging Face. A key value offered by this topology, a reverse proxy handing off LLM traffic to an AI FW, which in turn can allow traffic to reach target LLMs, stems from the fact that traffic is seen, and thus controllable, in both directions. Should an issue be present in a user’s submitted prompt, also known as an “inference”, it can be flagged: PII (Personally Identifiable Information) leakage is a frequent concern at this point. In addition, any LLM responses to prompts are also seen in the reverse path: consider a corrupted LLM providing toxicity in its generated replies. Not good. To achieve a highly performant reverse proxy approach to secured LLM access, a solution that can span a global set of users, F5 worked with Prompt Security to deploy an end-to-end AI security layer. This article will explore the efficacy and performance of the live solution. Impose LLM Guardrails with the AI Firewall and Distributed Cloud An AI firewall such as the Prompt Security offering can get in-line with AI LLM flows through multiple means. API calls from Curl or Postman can be modified to transmit to Prompt Security when trying to reach targets such as OpenAI or Azure OpenAI Service. Simple firewall rules can prevent employee direct access to these well-known API endpoints, thus making the Prompt Security route the sanctioned method of engaging with LLMs. A number of other methods could be considered but have concerns. Browser plug-ins have the advantage of working outside the encryption of the TLS layer, in a manner similar to how users can use a browser’s developer tools to clearly see targets and HTTP headers of HTTPS transactions encrypted on the wire. Prompt Security supports plugins. A downside, however, of browser plug-ins is the manageability issue, how to enforce and maintain across-the-board usage, simply consider the headache non-corporate assets used in the work environment. Another approach, interesting for non-browser, thick applications on desktops, think of an IDE like VSCode, might be an agent approach, whereby outbound traffic is handled by an on-board local proxy. Again, Prompt can fit in this model however the complexity of enforcement of the agent, like the browser approach, may not always be easy and aligned with complete A-to-Z security of all endpoints. One of the simplest approaches is to ingest LLM traffic through a network-centric approach. An F5 Distributed Cloud HTTPS load balancer, for instance, can ingest LLM-bound traffic, and thoroughly secure the traffic at the API layer, things like WAF policy and DDoS mitigations, as examples. HTTP-based control plane security is the focus here, as opposed to the encapsulated requests a user is sending to an LLM. The HTTPS load balancer can in turn hand off traffic intended for the likes of OpenAI to the AI gateway for prompt-aware inspections. F5 Distributed Cloud (XC) is a good architectural fit for inserting a third-party AI firewall service in-line with an organization’s inferencing requests. Simply project a FQDN for the consumption of AI services; in this article we used the domain name “llmsec.busdevF5.net” into the global DNS, advertising one single IP address mapping to the name. This DNS advertisement can be done with XC. The IP address, through BGP-4 support for anycast, will direct any traffic to this address to the closest of 27 international points of presence of the XC global fabric. Traffic from a user in Asia may be attracted to Singapore or Mumbai F5 sites, whereas a user in Western Europe might enter the F5 network in Paris or Frankfurt. As depicted, a distributed HTTPS load balancer can be configured – “distributed” reflects the fact traffic ingressing in any of the global sites can be intercepted by the load balancer. Normally, the server name indicator (SNI) value in the TLS Client Hello can be easily used to pick the correct load balancer to process this traffic. The first step in AI security is traditional reverse proxy core security features, all imposed by the XC load balancer. These features, to name just a few, might include geo-IP service policies to preclude traffic from regions, automatic malicious user detection, and API rate limiting; there are many capabilities bundled together. Clean traffic can then be selected for forwarding to an origin pool member, which is the standard operation of any load balancer. In this case, the Prompt Security service is the exclusive member of our origin pool. For this article, it is a cloud instantiated service - options exist to forward to Prompt implemented on a Kubernetes cluster or running on a Distributed Cloud AppStack Customer Edge (CE) node. Block Sensitive Data with Prompt Security In-Line AI inferences, upon reaching Prompt’s security service, are subjected to a wide breadth of security inspections. Some of the more important categories would include: Sensitive data leakage, although potentially contained in LLM responses, intuitively the larger proportion of risk is within the requesting prompt, with user perhaps inadvertently disclosing data which should not reach an LLM Source code fragments within submissions to LLMs, various programming languages may be scanned for and blocked, and the code may be enterprise intellectual property OWASP LLM top 10 high risk violations, such as LLM jailbreaking where the intent is to make the LLM behave and generate content that is not aligned with the service intentions; the goal may be embarrassing “screenshots”, such as having a chatbot for automobile vendor A actually recommend a vehicle from vendor B OWASP Prompt Injection detection, considered one of the most dangerous threats as the intention is for rogue users to exfiltrate valuable data from sources the LLM may have privileged access to, such as backend databases Token layer attacks, such as unauthorized and excessive use of tokens for LLM tasks, the so-called “Denial of Wallet” threat Content moderation, ensuring a safe interaction with LLMs devoid of toxicity, racial and gender discriminatory language and overall curated AI experience aligned with those productivity gains that LLMs promise To demonstrate sensitive data leakage protection, a Prompt Security policy was active which blocked LLM requests with, among many PII fields, a mailing address exposed. To reach OpenAI GPT3.5-Turbo, one of the most popular and cost-effective models in the OpenAI model lineup, prompts were sent to an F5 XC HTTPS load balancer at address llmsec.busdevf5.net. Traffic not violating the comprehensive F5 WAF security rules were proxied to the Prompt Security SaaS offering. The prompt below clearly involves a mailing address in the data portion. The ensuing prompt is intercepted by both the F5 and Prompt Security solutions. The first interception, the distributed HTTPS load balancer offered by F5 offers rich details on the transaction, and since no WAF rules or other security policies are violated, the transaction is forwarded to Prompt Security. The following demonstrates some of the interesting details surrounding the transaction, when completed (double-click to enlarge). As highlighted, the transaction was successful at the HTTP layer, producing a 200 Okay outcome. The traffic originated in the municipality of Ashton, in Canada, and was received into Distributed Cloud in F5’s Toronto (tr2-tor) RE site. The full details around the targeted URL path, such as the OpenAI /v1/chat/completions target and the user-agent involved, vscode-restclient, are both provided. Although the HTTP transaction was successful, the actual AI prompt was rejected, as hoped for, by Prompt Security. Drilling into the Activity Monitor in the Prompt UI, one can get a detailed verdict on the transaction (double-click). Following the yellow highlights above, the prompt was blocked, and the violation is “Sensitive Data”. The specific offending content, the New York City street address, is flagged as a precluded entity type of “mailing address”. Other fields that might be potentially blocking candidates with Prompt’s solution include various international passports or driver’s license formats, credit card numbers, emails, and IP addresses, to name but a few. A nice, time saving feature offered by the Prompt Security user interface is to simply choose an individual security framework of interest, such as GDPR or PCI, and the solution will automatically invoke related sensitive data types to detect. An important idea to grasp: The solution from Prompt is much more nuanced and advanced than simple REGEX; it invokes the power of AI itself to secure customer journeys into safe AI usage. Machine learning models, often transformer-based, have been fine-tuned and orchestrated to interpret the overall tone and tenor of prompts, gaining a real semantic understanding of what is being conveyed in the prompt to counteract simple obfuscation attempts. For instance, using printed numbers, such as one, two, three to circumvent Regex rules predicated on numerals being present - this will not succeed. This AI infused ability to interpret context and intent allows for preset industry guidelines for safe LLM enforcement. For instance, simply indicating the business sector is financial will allow the Prompt Security solution to pass judgement, and block if desired, financial reports, investment strategy documents and revenue audits, to name just a few. Similar awareness for sectors such as healthcare or insurance is simply a pull-down menu item away with the policy builder. Source Code Detection A common use case for LLM security solutions is identification and, potentially, blocking submissions of enterprise source code to LLM services. In this scenario, this small snippet of Python is delivered to the Prompt service: def trial(): return 2_500 <= sorted(choices(range(10_000), k=5))[2] < 7_500 sum(trial() for i in range(10_000)) / 10_000 A policy is in place for Python and JavaScript detection and was invoked as hoped for. curl --request POST \ --url https://llmsec.busdevf5.net/v1/chat/completions \ --header 'authorization: Bearer sk-oZU66yhyN7qhUjEHfmR5T3BlbkFJ5RFOI***********' \ --header 'content-type: application/json' \ --header 'user-agent: vscode-restclient' \ --data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "def trial():\n return 2_500 <= sorted(choices(range(10_000), k=5))[2] < 7_500\n\nsum(trial() for i in range(10_000)) / 10_000"}]}' Content Moderation for Interactions with LLMs One common manner of preventing LLM responses from veering into undesirable territory is for the service provider to implement a detailed system prompt, a set of guidelines that the LLM should be governed by when responding to user prompts. For instance, the system prompt might instruct the LLM to serve as polite, helpful and succinct assistant for customers purchasing shoes in an online e-commerce portal. A request for help involving the trafficking of narcotics should, intuitively, be denied. Defense in depth has traditionally meant no single point of failure. In the above scenario, screening both the user prompt and ensuring LLM response for a wide range of topics leads to a more ironclad security outcome. The following demonstrates some of the topics Prompt Security can intelligently seek out; in this simple example, the topic of “News & Politics” has been singled out to block as a demonstration. Testing can be performed with this easy Curl command, asking for a prediction on a possible election result in Canadian politics: curl --request POST \ --url https://llmsec.busdevf5.net/v1/chat/completions \ --header 'authorization: Bearer sk-oZU66yhyN7qhUjEHfmR5T3Blbk*************' \ --header 'content-type: application/json' \ --header 'user-agent: vscode-restclient' \ --data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "Who will win the upcoming Canadian federal election expected in 2025"}],"max_tokens": 250,"temperature": 0.7}' The response, available in the Prompt Security console, is also presented to the user. In this case, a Curl user leveraging the VSCode IDE. The response has been largely truncated for brevity, fields that are of interest is an HTTP “X-header” indicating the transaction utilized the F5 site in Toronto, and the number of tokens consumed in the request and response are also included. Advanced LLM Security Features Many of the AI security concerns are given prominence by the OWASP Top Ten for LLMs, an evolving and curated list of potential concerns around LLM usage from subject matter experts. Among these are prompt injection attacks and malicious instructions often perceived as benign by the LLM. Prompt Security uses a layered approach to thwart prompt injection. For instance, during the uptick in interest in ChatGPT, DAN (Do Anything Now) prompt injection was widespread and a very disruptive force, as discussed here. User prompts will be closely analyzed for the presence of the various DAN templates that have evolved over the past 18 months. More significantly, the use of AI itself allows the Prompt solution to recognize zero-day bespoke prompts attempting to conduct mischief. The interpretative powers of fine-tuned, purpose-built security inspection models are likely the only way to stay one step ahead of bad actors. Another chief concern is protection of the system prompt, the guidelines that reel in unwanted behavior of the offered LLM service, what instructed our LLM earlier in its role as a shoe sales assistant. The system prompt, if somehow manipulated, would be a significant breach in AI security, havoc could be created with an LLM directed astray. As such, Prompt Security offers a policy to compare the user provided prompt, the configured system prompt in the API call, and the response generated by the LLM. In the event that a similarity threshold with the system prompt is exceeded in the other fields, the transaction can be immediately blocked. An interesting advanced safeguard is the support for a “canary” word - a specific value that a well behaved LLM should never present in any response, ever. The detection of the canary word by the Prompt solution will raise an immediate alert. One particularly broad and powerful feature in the AI firewall is the ability to find secrets, meaning tokens or passwords, frequently for cloud-hosted services, that are revealed within user prompts. Prompt Security offers the ability to scour LLM traffic for in excess of 200 meaningful values. Just as a small representative sample of the industry’s breadth of secrets, these can all be detected and acted upon: Azure Storage Keys Detector Artifactory Detector Databricks API tokens GitLab credentials NYTimes Access Tokens Atlassian API Tokens Besides simple blocking, a useful redaction option can be chosen. Rather than risk compromise of credentials and obfuscated value will instead be seen at the LLM. F5 Positive Security Models for AI Endpoints The AI traffic delivered and received from Prompt Security’s AI firewall is both discovered and subjected to API layer policies by the F5 load balancer. Consider the token awareness features of the AI firewall, excessive token consumption can trigger an alert and even transaction blocking. This behavior, a boon when LLMs like the OpenAI premium GPT-4 models may have substantial costs, allows organizations to automatically shut down a malicious actor who illegitimately got hold of an OPENAI_API key value and bombarded the LLM with prompts. This is often referred to as a “Denial of Wallet” situation. F5 Distributed Cloud, with its focus upon the API layer, has congruent safeguards. Each unique user of an API service is tracked to monitor transactional consumption. By setting safeguards for API rate limiting, an excessive load placed upon the API endpoint will result in HTTP 429 “Too Many Request” in response to abusive behavior. A key feature of F5 API Security is the fact that it is actionable in both directions, and also an in-line offering, unlike some API solutions which reside out of band and consume proxy logs for reporting and threat detection. With the automatic discovery of API endpoints, as seen in the following screenshot, the F5 administrator can see the full URL path which in this case exercises the familiar OpenAI /v1/chat/completions endpoint. As highlighted by the arrow, the schema of traffic to API endpoints is fully downloadable as an OpenAPI Specification (OAS), formerly known as a Swagger file. This layer of security means fields in API headers and bodies can be validated for syntax, such that a field whose schema expects a floating-point number can see any different encoding, such as a string, blocked in real-time in either direction. A possible and valuable use case: allow an initial unfettered access to a service such as OpenAI, by means of Prompt Security’s AI firewall service, for a matter of perhaps 48 hours. After a baseline of API endpoints has been observed, the API definition can be loaded from any saved Swagger files at the end of this “observation” period. The loaded version can be fully pruned of undesirable or disallowed endpoints, all future traffic must conform or be dropped. This is an example of a “positive security model”, considered a gold standard by many risk-adverse organizations. Simply put, a positive security model allows what has been agreed upon through and rejects everything else. This ability to learn and review your own traffic, and then only present Prompt Security with LLM endpoints that an organization wants exposed is an interesting example of complementing an AI security solution with rich API layer features. Summary The world of AI and LLMs is rapidly seeing investment, in time and money, from virtually all economic sectors; the promise of rapid dividends in the knowledge economy is hard to resist. As with any rapid deployment of new technology, safe consumption is not guaranteed, and it is not built in. Although LLMs often suggest guardrails are baked into offerings, a 30-second search of the Internet will expose firsthand experiences where unexpected outcomes when invoking AI are real. Brand reputation is at stake and false information can be hallucinated or coerced out of LLMs by determined parties. By combining the ability to ingest globally at high-speed dispersed users and apply a first level of security protections, F5 Distributed Cloud can be leveraged as an onboarding for LLM workloads. As depicted in this article, Prompt Security can in turn handle traffic egressing F5’s distributed HTTPS load balancers and provide state-of-the-art AI safeguards, including sensitive data detection, content moderation and other OWASP-aligned mechanisms like jailbreak and prompt injection mitigation. Other deployment models exist, including deploying Prompt Security’s solution on-premises, self-hosted in cloud tenants, and running the solution on Distributed Cloud CE nodes themselves is supported.952Views4likes1CommentSecurely Scale RAG - Azure OpenAI Service, F5 Distributed Cloud and NetApp
Arguably, the easiest and most massively scalable approach to harnessing Large Language Models (LLMs) is to consume leading services like OpenAI endpoints, the most well-known of cloud-based offering delivered to enterprises over the general Internet. Access to hardware, such as GPUs, and the significant skillset to run LLMs on your own become non-issues, consumption is simply an API call away. One concern, and a serious one, is that sensitive inferencing (AI prompts, both the requests and responses) travels "in the wild" to these LLMs found through DNS at public endpoints. Retrieval Augmented Generation (RAG) adds potentially very sensitive corporate data to prompts, to leverage AI for internal use cases, thus ratcheting up even further the uneasiness with using the general Internet as a conduit to reach LLMs. RAG is a popular method to greatly increase the accuracy and relevancy of generative AI for a company’s unique set of problems. Finally, to leverage sensitive data with RAG, the source documents must be vectorized with similarly remote “embedding” LLMS; once again sensitive, potentially proprietary sensitive data will leave the corporate premises to leverage the large AI solutions like OpenAI or Azure OpenAI. Unlike purveyors of locally executed models, say a repository like Huggingface.com, which allow downloading of binaries to be harnessed on local compute, industry leading solutions like OpenAI and Azure OpenAI Service are founded on the paradigm of remote compute. Beyond the complexity and resources of quickly and correctly setting up performant on-prem models one time, the choice to consume remote endpoints allows hassle-free management like models perpetually updated to latest revisions and full white-glove support available to enterprise customers consuming SaaS AI models. In this article, an approach will be presented where, using F5 Distributed Cloud (XC) and NetApp, Azure OpenAI Service can be leveraged with privacy, where prompts are carried over secured, encrypted tunnels over XC between on-premises enterprise locations and that enterprise’s Azure VNET. The Azure OpenAI models are then exclusively exposed as private endpoints within that VNET, nowhere else in the world. This means both the embedding LLM activity to vectorize sensitive corporate data, and the actual generative AI prompts to harness value from that data are encrypted in flight. All source data and resultant vector databases remain on-premises in well-known solutions like a NetApp ONTAP storage appliance. Why is the Azure OpenAI Service a Practical Enabler of AI Projects? Some of the items that distinguish Azure OpenAI Service include the following: Prompts sent to Azure OpenAI are not forwarded to OpenAI, the service exists within Microsoft Azure, benefiting from the performance of Microsoft’s enormous cloud computing platform Customer prompts are never used for training data to build new or refine existing models Simplified billing, think of the Azure OpenAI Service as analogous to an “all you can eat buffet”, simply harness the AI service and settle the charge incurred on a regular monthly billing cycle With OpenAI, models are exposed at universal endpoints shared by a global audience, added HTTP headers such as the OPENAI_API_KEY value distinguish users and allow billing to occur in accordance with consumption. Azure OpenAI Service is slightly different. No models actually exist to be used until they are setup under an Azure subscription. At this point, beyond receiving an API key to identify the source user, the other major difference is unique API "base" URL (endpoint) is setup for accessing LLMs an organization wishes to use. Examples would be a truly unique enterprise endpoint for GPT-3.5-Turbo, GPT4 or perhaps an embedding LLM used in vectorization, such as the popular text-embedding-ada-002 LLM. This second feature of Azure OpenAI Service presents a powerful opportunity to F5 Distributed Cloud (XC) customers. This stems from the fact that unlike traditional OpenAI, this per-organization, unique base URL for API communications does not have to be projected into the global DNS, reachable from anywhere on the Internet. Instead, Microsoft Azure allows the OpenAI service to be constrained to a private endpoint, accessible only from where the customer chooses. Leveraging F5 XC Multicloud Networking offers a way to secure and encrypt communications between on-premises locations and Azure subnets only available from within the organization. What does this add up to for the enterprise with generative AI projects? It means huge scalability for AI services and consuming the very much leading-edge modern OpenAI models, all in a simple manner an enterprise can now consume today with limited technical onus on corporate technology services. The sense of certainty that sensitive data is not cavalierly exposed on the Internet is a critical cog in the wheel of good data governance. Tap Into Secure Data from NetApp ONTAP Clusters for Fortified Access to OpenAI Models The F5 Distributed Cloud global fabric consists of points of presence in 26+ metropolitan markets worldwide, such as Paris, New York, Singapore, that are interconnected with high-speed links aggregating to more than 14 Tbps of bandwidth in total, it is growing quarterly. With the F5 multicloud networking (MCN) solution, customers can easily set up dual-active encrypted tunnels (IPSec or SSL) to two points on the global fabric. The instances connected to are referred to as RE’s (Regional Edge nodes) and the customer-side sites are made up of CE’s (Customer Edge nodes, scalable from one to a full cluster). The service is a SaaS solution and setup is turn-key based upon menu click-ops or Terraform. The customer sites, beyond being in bricks-and-mortar customer data centers and office locations, can also exist within cloud locations such as Microsoft Azure Resource Groups or AWS VPCs, among others. Enterprise customers with existing bandwidth solutions may choose to directly interconnect sites as opposed to leveraging the high-speed F5 global fabric. The net result of an F5 XC Layer 3 multicloud network is high-speed, encrypted communications between customer sites. By disabling the default network access provided by Azure OpenAI Service, and only allowing private endpoint access, one can instantiate a private approach to running workloads with well-known OpenAI models. With this deployment in place, customers may tap into years of data acquired and stored on trusted on-premises NetApp storage appliances to inject value into AI use cases, customized and enhanced inference results using well-regarded, industry-leading OpenAI models. A perennial industry leader in storage is ONTAP from NetApp, a solution that can safely expose volumes to file systems, through protocols such as NFS and SMB/CIFS. The ability to also expose LUNs, meaning block-level data that constitutes remote disks, is also available using protocols like iSCSI. In the preceding diagram, one can leverage AI through a standard Python approach, in the case shown harnessing an Ubuntu Linux server, and volumes provided by ONTAP. AI jobs, rather than calling out to an Internet-routed Azure OpenAI public endpoint can instead interact with a private endpoint, one which resolves through private DNS to an address on a subnet behind a customer Azure CE node. This endpoint cannot be reached from the Internet, it is restricted to only communicating with customer subnets (routes) located in the L3 multicloud deployment. In use cases that leverage one’s own data, a leading approach is Retrieval Augmented Generation (RAG) in order to empower Large Language Models (LLMs) to deliver niche, hyper-focused responses pertaining to specialized, sometimes proprietary, documents representing the corporate body of knowledge. Simple examples might include highly detailed, potentially confidential, company-specific information distilled from years of financial internal reporting. Another prominent early use case of RAG is to backstop frontline, customer helpdesk employees. With customers sensitive to delays in handling support requests, and pressure to reduce support staff research delays, the OpenAI LLM can harvest only relevant knowledge base (KB) articles, releases notes, and private engineering documents not normally exposed in their entirety. The net result is a much more effective helpdesk experience, with precise, relevant help provided to the support desk employee in seconds. RAG Using Microsoft Azure OpenAI, F5 and NetApp in Nutshell In the sample deployment, one of the more important items to recognize is that two OpenAI models will be harnessed, an embedding LLM and a generative transformer based GPT family LLM. A simple depiction of RAG would be as follows: Using OpenAI Embedding LLMs The OpenAI embedding modeltext-embedding-ada-002 is used first to vectorize data sourced from the on-premises ONTAP system, via NFS volumes mounted to the server hosting Python. The embedding model consumes “chunks” of text from each sourced document and converts the text to numbers, specifically long sequences of numbers, typically in the range of 700 to 1,500 values. These are known as vectors. The vectors returned in the private OpenAI calls are then stored in a vector database, in this case ChromaDB was used. It is important to note, the ChromaDB itself was directed to install itself within a volume supported by the on-premises ONTAP cluster, as such the content at rest is governed by the same security governance as the source content in its native format. Other common industry solutions for vector storage and searches include Milvus and for those looking to cloud-hosted vectors Pinecone. Vector databases are purpose-built to manage vector embeddings. Conventional databases can, in fact, store vectors but the art of doing a semantic search, finding similarities between vectors, would then require vector indices solutions. One of the best known in FAISS (Facebook AI Similarity Search) which is a library that allows developers to quickly search for embeddings of multimedia documents. These semantic searches would otherwise be inefficient or impossible with standard database engines (SQL). When a prompt is first generated by a client, the text in the prompt is vectorized by the very same OpenAI embedding model, producing a vector on the fly. The key to RAG, the “retriever” function, then compares the newly arrived query with semantically similar text chunks in the database. The actual semantic similarity of the query and previously stored chunks is arrived at through a nearest neighbor search of the vectors, in other words, phrases and sentences that might augment the original prompt can be provided to the OpenAI GPT model. The art of finding semantic similarities relies upon comparing the lengthy vectors. The objective, for instance, to find supportive text around the user query “how to nurture shrub growth” might reasonably align more closely with a previously vectorized paragraph that included “gardening tips for the North American spring of 2024” and less so with vectorized content stemming from a user guide for the departmental photocopy machine. The suspected closeness of vectors, are text samples actually similar topic wise, is a feature of semantic similarity search algorithms, many exist in themarketplace and two approaches commonly leveraged are cosine similarity and Euclidean distance; a brief description for those interested can be found here. The source text chunks corresponding to vectors are retained in the database and it is this source text that augments the prompt after the closest neighbor vectors are calculated. Using OpenAI GPT LLMs Generative Pre-trained Transformer (GPT) refers to a family of LLMs created by OpenAI that are built on atransformer architecture. The specific OpenAI model used in this model is not necessarily the latest, premium model, GPT-4o and GPT-4 Turbo are more recent, however the utilized gpt-35-turbo model is a good intersection of price versus performance and has been used extensively in deployed projects. With the retriever function helping to build an augmented prompt, the default use case documented included three text chunks to buttress the original query. The OpenAI prompt response will not only be infused with the provided content extracted from the customer but unlike normal GPT responses, RAG will have specific attributions to which documents and specific paragraphs led to the response. Brief Overview of Microsoft OpenAI Service Setup Microsoft Azure has a long history of adding innovative new functions as subscribed “opt in” service resources, the Azure OpenAI Service is no different. A thorough, step-by-step guide to setting up the OpenAI service can be found here. This screenshot demonstrates the rich variety of OpenAI models available within Azure, specifically showing the Azure OpenAI Studio interface, highlighting models such as gpt-4, gpt-4o and dall-e-3. In this article, two models are added, one embedding and the other GPT. The following OpenAI Service Resource screen shows the necessary information to actually use our two models. This information consists of the keys (use either KEY1 and KEY2, both can be seen and copied with the Show Keys button) and the unique, per customer endpoint path, frequently referred to as the base URL by OpenAI users. Perhaps the key Azure feature that empowers this article is the ability to disable network access to the configured OpenAI model, as seen below. With traditional network access disabled, we can then enable private endpoint access and set the access point to a network interface on the private subnet connected to the inside interface of our F5 Distributed Cloud CE node. The following re-visits the earlier topology diagram, with focus upon where the Azure OpenAI service interacts with our F5 Distributed Cloud multicloud network. The steps involved in setting up an Azure site in F5 Distributed Cloud are found here. The corresponding steps for configuring an on-premises Distributed Cloud site are found in this location. Many options exist, such as using KVM or a bare metal server, the link provided highlights the VMware ESXi approach to on-premises site creation. Demonstrating RAG in Action using OpenAI Models with a Secure Private Endpoint The RAG setup, in lieu of vectorizing actual private and sensitive documents, utilized the OpenAI embedding LLM to process chunks taken from the classic H.G. Wells 1895 science fiction novel “The Time Machine” in text or markdown format. The novel is one of many in the public domain through the Gutenberg Project. Two NFS folders supported by the NetApp ONTAP appliance in a Redmond, Washington office were used: one for source content and one for supporting the ChromaDB vector database. The NFS mounts are seen below, with the Megabytes consumed and remaining available seen per volume, the ONTAP address can be seen as 10.50.0.220. (Linux Host) #df -h 10.50.0.220:/RAG_Source_Documents_2024 1.9M 511M 1% /mnt/rag_source_files 10.50.0.220:/Vectors 17M 803M 3% /home/sgorman/langchain-rag-tutorial-main/chroma2 The creation of the vector database was handled by one Python script and the actual AI prompts generated against the OpenAI gpt-35-turbo model were housed in another script. This may often make sense, as the vector database creation may be an infrequently run script, only executed when new source content is introduced (/mnt/rag_source_files) whereas the generative AI tasks targeting gpt-3.5-turbo are likely run continuously for imperative business needs like helpdesk or code creations, as example purposes. Creating the vector database first entails preparing the source text, typically remove extraneous formatting or less than valuable text fields, think of boilerplate statements such as repetitive footnotes or perhaps copyright/privacy statements that might be found on every single page of some corporate documents. The next step is to create text chunks for embedding, the tradeoff of using too short chunks will be lack of semantic meaning in any one chunk and a growth in the vector count. Using overly long chunks, on the other hand, could lead to lengthy augmented prompts sent to gpt-35-turbo that significantly grow the token count for requests, although many models now support very large token counts a common value remains a total, for requests and responses, of 4,096 tokens. Token counts are the foundation for most billing formulae of endpoint-based AI models. Finally, it is important to have some degree of overlap of generated chunks such that meanings and themes within documents are not lost; if an idea is fragmented at the demarcation point of adjacent chunks the model may not pickup on its importance. The vectorization script for “The Time Machine” resulted in 978 chunks being created from the source text, with character counts per chunk not to exceed 300 characters. The text splitting function is loaded from LangChain and the pertinent code lines include: from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=300, chunk_overlap=100, } The values of 100 characters of overlaps suggests each chunk will incorporate 200 characters of new text within the total of 300 grabbed. It is important to remember all characters, even white space, count towards totals. As per the following screenshot, the source novel, when split into increments of 200 new characters per chunk does indicate 978 chunks were indeed a correct total (double click to expand). With the source data vectorized and secure on the NetApp appliance, the actual use of the gpt-35-turbo OpenAI model could commence. The following shows an example, where the model is instructed in the system prompt to only respond via information it can glean from the RAG augmented prompt text, the response portions shown in red font. python3 create99.py “What is the palace of green porcelain?” <response highlights below, the response also included the full text chunks RAG suggested would potentially support the LLM in answering the posed question> Answer the question based on the above context: What is the palace of green porcelain? Response: content='The Palace of Green Porcelain is a deserted and ruined structure with remaining glass fragments in its windows and corroded metallic framework.' response_metadata={'token_usage': {'completion_tokens': 25, 'prompt_tokens': 175, 'total_tokens': 200}, 'model_name': 'gpt-35-turbo', The response of gpt-35-turbo is correct, and we see that the token consumption is heavily slanted towards the request (the “prompt”), with 175 tokens used whereas the response required only 25 tokens. The key takeaways are that the prompt and its response did not travel hop-by-hop over the Internet to a public endpoint, all traffic traveled with VPN-like security from the on-premises server and ONTAP to a private Azure subnet using F5 Distributed Cloud. The OpenAI model was utilized as a private endpoint, corresponding a network interface available only on that private subnet and not found within the global DNS, only the private corporate DNS or /etc/hosts files. Adding Laser Precision to RAG Using the default chunking strategy did lead to sub-optimal results, when ideas, themes and events were lost across chunk boundaries, even when including some degree of overlap. The following is one example: A key moment in the H.G. Wells book involves the protagonist meeting a character Weena, who provides strange white flowers which are pocketed. Upon returning to the present time, the time traveler relies upon the exotic and foreign look of the white flowers to attempt to prove to friends the veracity of his tale. # python3 query99.py “What did Weena give the Time Traveler?” As captured in the response below, the chunks provided by RAG do not provide all the details, only that something of note was pocketed, but gpt-35-turbo can therefore not return a sufficient answer as the full details are not provided in the augmented prompt. The screenshot shows first the three chunks and at the end the best answer the LLM could provide (double click to expand). The takeaway is that some effort will be required to adjust the vectorization process to pick optimally large chunk sizes, and sufficient numbers to properly empower the OpenAI model. In this demonstration, based upon vectors and their corresponding text, only three text chunks were harnessed to augment the user prompt. By increasing this number to 5 or 10, and increasing each of the chunk sizes, all of course at the expense of token consumption, one would expect more accurate results from the LLM. Summary This article demonstrated a more secure approach to using OpenAI models as a programmatic endpoint service in which proprietary company information can be kept secure by not using the general purpose, insecure Internet to provide prompts for vectorization and general AI inquiries. Instead, an approach was followed where the Azure OpenAI service was deployed as a private endpoint, exclusively available at an address on a private subnet within an enterprise’s Azure subscription, a subnet with no external access. By utilizing F5 Distributed Cloud Multicloud Networking, existing corporate locations and data centers can be connected to that enterprise’s Azure resource groups and private, encrypted communications can take place between these networks, the necessary routing and tunneling technologies are deployed in a turn-key manner without requiring advanced network skillsets. When leveraging NetApp ONTAP as the continued enterprise storage solution, RAG deployments based upon Azure OpenAI service can continue to be managed and secured with well-developed storage administration skills. In this example, ONTAP housed both the source, sensitive enterprise content and the actual vector database resulting from interactions with the Azure OpenAI embedding LLM. Subsequent to a discussion on vectors and optimal chunking strategies, RAG was utilized to answer questions on private documents using the well-known OpenAI chat-35-turbo model.353Views2likes1CommentSecure, Deliver and Optimize Your Modern Generative AI Apps with F5
In this demo, Foo-Bang Chan explores how F5's solutions can help you implement, secure, and optimize your chatbots and other AI applications. This will ensure they perform at their best while protecting sensitive data. One of the AI frameworks showed is Enterprise Retrieval-Augmented Generation (RAG). This demo leverages F5 Distributed Cloud (XC) AppStack, Distributed Cloud WAAP, NGINX Plus as API Gateway, API-Discovery, API-Protection, LangChain, Vector databases, and Flowise AI.428Views6likes2CommentsSecuring and Scaling Hybrid Apps with F5 NGINX Part 4
In previous parts of our series, we learned that NGINX is superior to cloud load balancers for two reasons: Breaking free from vendor lock-in;NGINX is a solution applicable to any infrastructure/environment Cloud providers offer basic load balancers that route and encrypt traffic to endpoints. They lack in: Visibility; Logs, traces, and statistics Functionality; Advanced traffic management and security use cases (see part 2 and 3) Functionality is especially important when scaling the environment to multiple cluster groups. The bulk of this section will be addressing best practices in scaling the architecture in part 2 and 3. Below I will depict a reference architecture that replicates my Kubernetes cluster with an NGINX Ingress Controller deployment and NGINX Load Balancer with HA (High Availability). If you recall from the part 2 and 3 of our series, I configured many ZT use cases on my NGINX Plus external load balancer. I replicated my NGINX Plus external load balancer to an active-active HA setup with NGINX Plus based on keepalived and VRRP. The method of fully rolling out HA in production will vary slightly depending on my environment. Public Cloud If I am scaling the architecture in a public cloud environment, I can replicate the NGINX Plus load balancers with cloud auto-scaling groups and front them with F5 BIG-IP. I can also enable health monitors on my BIG-IP so that unresponsive connections will fail-over to healthy NGINX Plus load balancers. On-Premises If I am scaling my architecture on-prem, I can replicate NGINX Plus load balancers with additional bare metal machines or use a virtualization software of my choosing. The HA solution with NGINX Plus on-prem can be setup in three different modes: Active-Passive; One instance is active, and the other is redundancy. the VIP (Virtual IP) will switch over to the redundancy node when the master node fails. Active-Active; Both instances are active and serving traffic. Two VIPs are required, where each VIP is assigned to an instance. If one instance becomes unavailable, the assigned VIP will switch over to the other instance and vice versa. Active-Active-Passive; Adding an additional redundancy node to the Active-Active HA pair resulting in a three node cluster group. The redundancy node will switch on when both active nodes are down. Choosing between these modes will depend on my current priorities. Going with the active-passive model, I compromise efficiency for lower cost. The active node is prone to overloading while the redundant node is idle and mostly not serving traffic. Going with the active-active or active-active-passive model, I compromise cost for better efficiency. However, I will need two VIPs and a DNS load balancer (F5 GTM) fronting my NGINX HA cluster. The table below depicts the three models measured by cost, efficiency and scale. Cost Efficiency Scale Active-Passive Low Low Low Active-Active Medium High Medium Active-Active-Passive High Medium High If I have the money to spend and choose both efficiency and scale, then active-active or active-active-passive is the right choice. Synchronizing data across NGINX Plus Cluster Nodes Recalling the part 2 and 3, we went through several ZT use cases with the NGINX Plus load balancer. Many of these ZT use cases require shared memory zones to store data and authenticate/authorize users. When scaling out the Zero Trust Architecture with HA, the key-value shared memory zone should be synchronized between NGINX Plus instances to ensure consistency. Take for example a popular ZT use case; OIDC authentication. Tokens are stored in the Key-Value storage to examine users attempting access to protected back-end applications. We can extend our configuration and enable Key-Val zone sync with two additional steps: Open a TCP medium where key-value data is exchanged. Now that you can also enable SSL on the TCP medium for extra security Append the optional sync directive to enable synchronization of key-value shared memory zones defined in openid_connect_configuration.conf from part 2. Testing the Synchronization You can test/validate the synchronization by leveraging the NGINX Plus API to pull data from individual cluster nodes and comparing them. The data pulled from each cluster node should be identical. You can connect to NGINX cluster nodes via SSH and enter: $ curl http://localhost:8010/api/7/http/keyvals/oidc_access_tokens The response data from each NGINX cluster node will match when zone sync is enabled. Data Telemetry with NGINX Management Interfaces As my IT organization grows, so will my NGINX cluster groups. Inevitably, I will need a solution that manage complexities arising from expanding NGINX cluster groups to alternative regions and cloud environments. With NGINX management tools, you can: Track your NGINX inventory for common CVEs and expired certificates Stage/Push config templates to NGINX cluster groups Collect and aggregate metrics from NGINX cluster groups Use our Flexible Consumption Model (FCP) model Installation and Deployment Methods There are two ways to get started with NGINX management capabilities: F5 Distributed Cloud (XC);No installation or deployment is required. Simply log into XC and access the NGINX One SaaS console. Get started with NGINX One Early Access. Self-Managed Installation; You can deploy NGINX Instance Manager to comply with policies and regulations that make the use of a SaaS console not feasible, for example, for air gapped environments inaccessible from the public internet. You caninstall and manage your own NGINX Instance Manager deployments by following our documentation. Once signed into my NGINX SaaS console, I can install agents on my NGINX HA clusters pair to discover them. $ curl https://agent.connect.nginx.com/nginx-agent/install | DATA_PLANE_KEY="z" sh -s -- -y $ sudo systemctl start nginx-agent I can track my overall NGINX usage and telemetry from either the UI console or APIs. Under FCPs (Flexible Consumption Plans), consumption is measured yearly based on the number of managed instances. This model is becoming increasingly popular as customers increasingly opt for flexible pays-as-go licensing models. Setting up F5 BIG-IP I touched on 2 options to configure F5 BIG-IP depending on my cloud environment. In on-prem environments with active-active HA targets, I need to configure F5 DNS load balancing. I will now step through how to configure DNS load balancing on BIG-IP. The first step is to create a VS (Virtual Server) listening on UDP with service port 53 for DNS. Then I create a wide IP with name 'nginxdemo.f5lab.com'. This will be the domain I will use to connect to my BIG-IP DNS load balancer. If I click on the 'Pools' tab, I can see my gslbPool members, where each member will correspond to VIPs assigned to my NGINX Plus HA cluster nodes. I also need to create a datacenter with a Server List on the NGINX HA cluster nodes and the BIG-IP DNS system. Under DNS>>GSLB : Servers : Server List, I can start adding my NGINX members and BIG-IP DNS system. Note:In a public cloud environment, I typically will not need to configure GSLB on the BIG-IP. I can simply create a VirtualServer with HTTPS and service port 443 and NGINX Plus pool members with a health monitor for redundancy fail over. Conclusion As we progressed in our series, I expanded my architecture to address scalability concerns that inevitably surface to any business. However, the IT architecture of a business needs to be flexible and agile if it wants to thrive, especially in this modern competitive landscape. The solutions I presented in these series are fundamental building blocks that can technically be implemented anywhere. They enable organizations to quickly maneuver and seek out alternative options when current ones are no longer viable, which brings me to the topic of AI. How will enterprises adopt AI in the present and future? Ultimately it will come down to extending reference architectures (like the ones discussed in this series) with AI components (LLMs, Vector DBs, RAG, etc...). These components plug into the overall architecture to improve automation and efficiency in the overall business model. In the next series, we will discuss AI reference architectures with F5/NGINX.245Views0likes0Comments