llm

9 Topics

F5 and MinIO: AI Data Delivery for the hybrid enterprise
Introduction Modern application architectures demand solutions that not only handle exponential data growth but also enable innovation and drive business results. As AI/ML workloads take center stage in industries ranging from healthcare to finance, application designers are increasingly turning to S3-compliant object storage because of its ability to provide scalable management of unstructured data. Whether it’s for ingesting massive datasets, running iterative training models, or delivering high-throughput predictions, S3-compatible storage systems play a foundational role in supporting these advanced data pipelines. MinIO has emerged as a leader in this space, offering high-performance, S3-compatible object storage built for modern-scale applications. MinIO is designed to easily work with AI/ML workflows. It is lightweight and cloud-based, so it is a good choice for businesses that are building infrastructure to support innovation. From storing petabyte-scale datasets to providing the performance needed for real-time AI pipelines, MinIO delivers the reliability and speed required for data-intensive work. While S3-compliant storage like MinIO forms the backbone of data workflows, robust traffic management and application delivery capabilities are essential for ensuring continuous availability, secure pipelines, and performance optimization. F5 BIG-IP, with its advanced suite of traffic routing, load balancing, and security tools, complements MinIO by enabling organizations to address these challenges. Together, F5 and MinIO create a resilient, scalable architecture where applications and AI/ML systems can thrive. This solution empowers businesses to: Build secure and highly-available storage pipelines for demanding workloads. Ensure fast and reliable delivery of data, even at exascale. Simplify and optimize their infrastructure to drive innovation faster. In this article, we’ll explore how to leverage F5 BIG-IP and MinIO AIStor clusters to enable results-driven application design. Starting with an architecture overview, we’ll cover practical steps to set up BIG-IP to enhance MinIO’s functionality. Along the way, we’ll highlight how this combination supports modern AI/ML workflows and other business-critical applications. Architecture Overview To validate the solution of F5 BIG-IP and MinIO AIStor effectively, this setup incorporates a functional testing environment that simulates real-world behaviors while remaining controlled and repeatable. MinIO’s warp benchmarking tool is used for orchestrating and running tests across the architecture. The addition of benchmarking tools ensures that the functional properties of the stack (traffic management, application-layer security, and object storage performance) are thoroughly evaluated in a way that is reproducible and credible. The environment consists of: A F5 VELOS chassis with BX110 blades, running BIG-IP instances configured using F5’s AS3 extension, for traffic management and security policies using LTM (Local Traffic Manager) and ASM (Application Security Manager). A MinIO AIStor cluster consisting of four bare-metal nodes equipped with high-performance NVMe drives, bringing the environment close to real-world customer deployments. Three benchmarking nodes for orchestrating and running tests: One orchestration node directs the worker nodes with benchmark test configuration and aggregates test results. Two worker nodes run warp in client mode to simulate workloads against the MinIO cluster. Warp Benchmarking Tool The warp benchmarking tool (https://github.com/minio/warp) from MinIO is designed to simulate real-world S3 workloads while also generating measurable metrics about the testing environment. In this architecture: A central orchestration node is used to coordinate the benchmarking process. This ensures that each test is consistent and runs under comparable conditions. Two worker nodes running warp in client mode send simulated traffic to the F5 BIG-IP virtual server. These nodes act as workload generators, allowing for the simulation of read-heavy, write-heavy, or mixed object storage exercises. Warp’s distributed design enables the scaling of workload generation, ensuring that the MinIO backend is tested under real-world-like conditions. This three-node configuration ensures that benchmarking tests are distributed effectively. It also provides insights into object storage behavior, traffic management, and the impact of security enforcement in the environment. Traffic Management and Security with BIG-IP At the center of this setup is the F5 VELOS chassis, running BIG-IP instances configured to handle both traffic management (LTM) and application-layer security (ASM). The addition of ASM (Application Security Manager) ensures that the MinIO cluster is protected from malicious or malformed requests while maintaining uninterrupted service for legitimate traffic. Key functions of BIG-IP in this architecture include: Load Balancing: Avoid overloading specific MinIO nodes by using adaptive algorithms, ensuring even traffic distribution, and preventing bottlenecks and hotspots. Apply advanced load balancing methods like least connections, dynamic ratio, and least response time. These methods intelligently account for backend load and performance in real time, ensuring reliable and efficient resource utilization. SSL/TLS Termination: Terminate SSL/TLS traffic to offload encryption workloads from backend clients. Re-encryption is optionally enabled for secure communication to MinIO nodes, depending on performance and security requirements. Health Monitoring: Intelligently performs continuous monitoring of the availability and health of the backend MinIO nodes. It reroutes traffic away from unhealthy nodes as necessary and restores traffic as service is restored. Application-Layer Security: Protect the environment via Web Application Firewall (WAF) policies that block malicious traffic, including injection attacks, malformed API calls, and DDoS-style app-layer threats. BIG-IP acts as the gateway for all requests coming from S3 clients, ensuring that security, health checks, and traffic policies are all applied before requests reach the MinIO nodes. Traffic Flow Through the Full Architecture The test traffic flows through several components in this architecture, with BIG-IP and warp playing vital roles in managing and generating requests, respectively: Benchmark Orchestration: The warp orchestration node initiates tests and distributes workload configurations to the worker nodes. The warp orchestration node also aggregates test data results from the worker nodes. Warp manages benchmarking scenarios, such as read-heavy, write-heavy, or mixed traffic patterns, targeting the MinIO storage cluster. Simulated Traffic from Worker Nodes: Two worker nodes, running warp in client mode, generate S3-compatible traffic such as object PUT, GET, DELETE, or STAT requests. These requests are transmitted through the BIG-IP virtual server. The load generation simulates the kind of requests an AI/ML pipeline or data-driven application might send under production conditions. BIG-IP Processing: Requests from the worker nodes are received by BIG-IP, where they are subjected to: Traffic Control: LTM distributes the traffic among the four MinIO nodes while handling SSL termination and monitoring node health. Security Controls: ASM WAF policies inspect requests for signs of application-layer threats. Only safe, valid traffic is routed to the MinIO environment. Environment Configuration Prerequisites BIG-IP (physical or virtual) Hosts for the MinIO cluster, including configured operating systems (and scheduling systems if optionally selected) Hosts for the warp worker nodes and warp orchestration node, including configured operating systems All required networking gear to connect the BIG-IP and the nodes A copy of the AS3 template at https://github.com/f5businessdevelopment/terraform-kvm-minio/blob/main/as3manualtemplate.json A copy of the warp configuration file at https://github.com/minio/warp/blob/master/yml-samples/mixed.yml Step 1: Set up MinIO Cluster Follow MinIO’s install instructions at https://min.io/docs/minio/linux/index.html The link is for a Linux deployment but choose the deployment target that’s appropriate for your environment. Record the addresses and ports of the MinIO consoles and APIs configured in this step for use as input to the next steps. Step 2: Configure F5 BIG-IP for Traffic Management and Security Following the steps documented in https://github.com/f5businessdevelopment/terraform-kvm-minio/blob/main/MANUALAS3.md and using the template file downloaded from GitHub, create and apply an AS3 declaration to configure your BIG-IP. Step 3: Deploy and Configure MinIO Warp for Benchmarking Retrieve API Access and Secret keys Log into your MinIO Cluster and click the 'Access' icon Once in 'Access', click the 'Access Keys' button In ‘Access Keys’, click the ‘Create Access Keys’ button and follow the steps to create and record your access and secret key values. Update warp key and secret value In your warp configuration file, find the access-key and secret-key fields and update the values with those you recorded in the previous step. Update warp client addresses In your warp configuration file, find the warp-client field and update the value with the addresses of the worker nodes. Update warp s3 host address In your warp configuration file, find the host field and update the value with the address and port of the VIP listener on the BIG-IP Step 4: Verify and Monitor the Environment Start the warp on each of the worker nodes with the command warp client Once the warp clients respond that they are listening on the warp orchestrator node, start the warp benchmark test warp run test.yaml Replace test.yaml with the name of your configuration file Summary of Test Results Functional tests done in F5’s lab, using the method described above, show how the F5 + MinIO solution works and behaves. These results highlight important considerations that apply to both AI/ML pipelines and data repatriation workflows. This enables organizations to make informed design choices when deploying similar architectures. The testing goals were: Validate that BIG-IP security and traffic management policies function properly with MinIO AIStor in a simulated real-world configuration. Compare the impact of various load-balancing, security, and storage strategies to determine best practices. Test Methodology Four test configurations were executed to identify the effects of: Threads Per Worker: Testing both 1-thread and 20-thread configurations for workload generation. Multi-Part GETs and PUTs: Comparing scenarios with and without multi-part requests for better parallelization. BIG-IP Profiles: Evaluating Layer 7 (ASM-enabled security) versus Layer 4 (performance-optimized) profiles. Test Results Test Configuration Throughput Benefits 20 threads, multi-part, Layer 7 28.1 Gbps Security, High-performance reliability 20 threads, multi-part, Layer 4 81.5 Gbps High-performance reliability 1 thread, no multi-part, Layer 7 3.7 Gbps Security, Reliability 1 thread, no multi-part, Layer 4 7.8 Gbps Reliability Note: The testing results provide insights into the solution and behavior of this setup, though they are not intended as production performance benchmarks. Key Insights Multi-Part GETs and PUTs Are Critical for Throughput Optimization: Multi-part operations split objects into smaller parts for parallel processing. This allows the architecture to better utilize MinIO’s distributed storage capabilities and worker thread concurrency. Without multi-part GETs/PUTs, single-threaded configurations experienced severely reduced throughput. Recommendation: Ensure multi-part operations are enabled in applications or tools interacting with MinIO when handling large objects or high IOPS workloads. Balance Security with Performance: Layer 7 security provided by ASM is essential for sensitive data and workloads that interact with external endpoints. However, it introduces processing overhead. Layer 4 performance profiles, while lacking application-layer security features, deliver significantly higher throughput. Recommendation: Choose BIG-IP profiles based on specific workload requirements. For AI/ML data ingest and model training pipelines, consider enabling Layer 4 optimization during bulk read/write phases. For workloads requiring external access or high-security standards, deploy Layer 7 profiles. In some cases, consider horizontal scaling of load balancing and object storage tiers to add throughput capacity. Threads Per Worker Impact Throughput: Scaling up threads at the worker level significantly increased throughput in the lab environment. This demonstrates the importance of concurrency for demanding workloads. Recommendation: Optimize S3 client configurations for higher connection counts where workloads permit, particularly when performing bulk data transfers or operationally intensive reads. Example Use Cases Use Case #1: AI/ML Pipeline AI and machine learning pipelines rely heavily on storage systems that can ingest, process, and retrieve vast amounts of data quickly and securely. MinIO provides the scalability and performance needed for storage, while F5 BIG-IP ensures secure, optimized data delivery. Pipeline Workflow An enterprise running a typical AI/ML pipeline might include the following stages: Data Ingestion: Large datasets (e.g., images, logs, training corpora) are collected from various sources and stored within the MinIO cluster using PUT operations. Model Training: Data scientists iterate on AI models using the stored training datasets. These training processes generate frequent GET requests to retrieve slices of the dataset from the MinIO cluster. Model Validation and Inference: During validation, the pipeline accesses specific test data objects stored in the cluster. For deployed models, inference may require low-latency reads to make predictions in real time. How F5 and MinIO Support the Workflow This combined architecture enables the pipeline by: Ensuring Consistent Availability: BIG-IP distributes PUT and GET requests across the four nodes in the MinIO cluster using intelligent load balancing. With health monitoring, BIG-IP proactively reroutes traffic away from any node experiencing issues, preventing delays in training or inference. Optimizing Performance: NVMe-backed storage in MinIO ensures fast read and write speeds. Together with BIG-IP's traffic management, the architecture delivers reliable throughput for iterative model training and inference. Securing End-to-End Communication: ASM protects the MinIO storage APIs from malicious requests, including malformed API calls. At the same time, SSL/TLS termination secures communications between AI/ML applications and the MinIO backend. Use Case #2: Enterprise Data Repatriation Organizations increasingly seek to repatriate data from public clouds to on-premises environments. Repatriation is often driven by the need to reduce cloud storage costs, regain control over sensitive information, or improve performance by leveraging local infrastructure. This solution supports these workflows by pairing MinIO’s high-performance object storage with BIG-IP’s secure and scalable traffic management. Repatriation Workflow A typical enterprise data repatriation workflow may look like this: Bulk Data Migration: Data stored in public cloud object storage systems (e.g., AWS S3, Google Cloud Storage) is transferred to the MinIO cluster running on on-premises infrastructure using tools like MinIO Gateway or custom migration scripts. Policy Enforcement: Once migrated, BIG-IP ensures that access to the MinIO cluster is secured, with ASM enforcing WAF policies to protect sensitive data during local storage operations. Ongoing Storage Optimization: The migrated data is integrated into workflows like backup and archival, analytics, or data access for internal applications. Local NVMe drives in the MinIO cluster reduce latency compared to cloud solutions. How F5 and MinIO Support the Workflow This architecture facilitates the repatriation process by: Secure Migration: MinIO Gateway, combined with SSL/TLS termination on BIG-IP, allows data to be transferred securely from public cloud object storage services to the MinIO cluster. ASM protects endpoints from exploitation during bulk uploads. Cost Efficiency and Performance: On-premises MinIO storage eliminates expensive cloud storage costs while providing faster access to locally stored data. NVMe-backed nodes ensure that repatriated data can be rapidly retrieved for internal applications. Scalable and Secure Access: BIG-IP provides secure access control to the MinIO cluster, ensuring only authorized users or applications can use the repatriated data. Health monitoring prevents disruptions in workflows by proactively managing node unavailability. The F5 and MinIO Advantage Both use cases reflect the flexibility and power of combining F5 and MinIO: AI/ML Pipeline: Supports data-heavy applications and iterative processes through secure, high-performance storage. Data Repatriation: Empowers organizations to reduce costs while enabling seamless local storage integration. These examples provide adaptable templates for leveraging F5 and MinIO to solve problems relevant to enterprises across various industries, including finance, healthcare, agriculture, and manufacturing. Conclusion The collaboration of F5 BIG-IP and MinIO provides a high-performance, secure, and scalable architecture for modern data-driven use cases such as AI/ML pipelines and enterprise data repatriation. Testing in the lab environment validates the functionality of this solution, while highlighting opportunities for throughput optimization via configuration tuning. To bring these insights to your environment: Test multi-part configurations using tools like MinIO warp benchmark or production applications. Match BIG-IP profiles (Layer 4 or Layer 7) with the specific priorities of your workloads. Use these findings as a baseline while performing further functional or performance testing in your enterprise. The flexibility of this architecture allows organizations to push the boundaries of innovation while securing critical workloads at scale. Whether driving new AI/ML pipelines or reducing costs in repatriation workflows, the F5 + MinIO solution is well-equipped to meet the demands of modern enterprises. Further Content For more information about F5's partnership with MinIO, consider looking at the informative overview by buulam on DevCentral's YouTube channel. We also have the steps outlined in this video.
Mark_Menger
Jul 17, 2025 Place Technical Articles
569Views
2likes
0Comments
Secure AI RAG using F5 Distributed Cloud in Red Hat OpenShift AI and NetApp ONTAP Environment
Introduction Retrieval Augmented Generation (RAG) is a powerful technique that allows Large Language Models (LLMs) to access information beyond their training data. The “R” in RAG refers to the data retrieval process, where the system retrieves relevant information from an external knowledge base based on the input query. Next, the “A” in RAG represents the augmentation of context enrichment, as the system combines the retrieved relevant information and the input query to create a more comprehensive prompt for the LLM. Lastly, the “G” in RAG stands for response generation, where the LLM generates a response with a more contextually accurate output based on the augmented prompt as a result. RAG is becoming increasingly popular in enterprise AI applications due to its ability to provide more accurate and contextually relevant responses to a wide range of queries. However, deploying RAG can introduce complexity due to its components being located in different environments. For instance, the datastore or corpus, which is a collection of data, is typically on-premise for enhanced control over data access and management due to data security, governance, and compliance with regulations within the enterprise. Meanwhile, inference services are often deployed in the cloud for their scalability and cost-effectiveness. In this article, we will discuss how F5 Distributed Cloud can simplify the complexity and securely connect all RAG components seamlessly for enterprise RAG-enabled AI applications deployments. Specifically, we will focus on Network Connect, App Connect, and Web App & API Protection. We will demonstrate how these F5 Distributed Cloud features can be leveraged to secure RAG in collaboration with Red Hat OpenShift AI and NetApp ONTAP. Example Topology F5 Distributed Cloud Network Connect F5 Distributed Cloud Network Connect enables seamless and secure network connectivity across hybrid and multicloud environments. By deploying F5 Distributed Cloud Customer Edge (CE) at site, it allows us to easily establish encrypted site-to-site connectivity across on-premises, multi-cloud, and edge environment. Jensen Huang, CEO of NVIDIA, has said that "Nearly half of the files in the world are stored on-prem on NetApp.”. In our example, enterprise data stores are deployed on NetApp ONTAP in a data center in Seattle managed by organization B (Segment-B: s-gorman-production-segment), while RAG services, including embedding Large Language Model (LLM) and vector database, is deployed on-premise on a Red Hat OpenShift cluster in a data center in California managed by Organization A (Segment-A: jy-ocp). By leveraging F5 Distributed Cloud Network Connect, we can quickly and easily establish a secure connection for seamless and efficient data transfer from the enterprise data stores to RAG services between these two segments only: F5 Distributed Cloud CE can be deployed as a virtual machine (VM) or as a pod on a Red Hat OpenShift cluster. In California, we deploy the CE as a VM using Red Hat OpenShift Virtualization — click here to find out more on Deploying F5 Distributed Cloud Customer Edge in Red Hat OpenShift Virtualization: Segment-A: jy-ocp on CE in California and Segment-B: s-gorman-production-segment on CE in Seattle: Simply and securely connect Segment-A: jy-ocp and Segment-B: s-gorman-production-segment only, using Segment Connector: NetApp ONTAP in Seattle has a LUN named “tbd-RAG”, which serves as the enterprise data store in our demo setup and contains a collection of data. After these two data centers are connected using F5 XC Network Connect, a secure encrypted end-to-end connection is established between them. In our example, “test-ai-tbd” is in the data center in California where it hosts the RAG services, including embedding Large Language Model (LLM) and vector database, and it can now successfully connect to the enterprise data stores on NetApp ONTAP in the data center in Seattle: F5 Distributed Cloud App Connect F5 Distributed Cloud App Connect securely connects and delivers distributed applications and services across hybrid and multicloud environments. By utilizing F5 Distributed Cloud App Connect, we can direct the inference traffic through F5 Distributed Cloud's security layers to safeguard our inference endpoints. Red Hat OpenShift on Amazon Web Services (ROSA) is a fully managed service that allows users to develop, run, and scale applications in a native AWS environment. We can host our inference service on ROSA so that we can leverage the scalability, cost-effectiveness, and numerous benefits of AWS’s managed infrastructure services. For instance, we can host our inference service on ROSA by deploying Ollama with multiple AI/ML models: Or, we can enable Model Serving on Red Hat OpenShift AI (RHOAI). Red Hat OpenShift AI (RHOAI) is a flexible and scalable AI/ML platform builds on the capabilities of Red Hat OpenShift that facilitates collaboration among data scientists, engineers, and app developers. This platform allows them to serve, build, train, deploy, test, and monitor AI/ML models and applications either on-premise or in the cloud, fostering efficient innovation within organizations. In our example, we use Red Hat OpenShift AI (RHOAI) Model Serving on ROSA for our inference service: Once inference service is deployed on ROSA, we can utilize F5 Distributed Cloud to secure our inference endpoint by steering the inference traffic through F5 Distributed Cloud's security layers, which offers an extensive suite of features designed specifically for the security of modern AI/ML inference endpoints. This setup would allow us to scrutinize requests, implement policies for detected threats, and protect sensitive datasets before they reach the inferencing service hosted within ROSA. In our example, we setup a F5 Distributed Cloud HTTP Load Balancer (rhoai-llm-serving.f5-demo.com), and we advertise it to the CE in the datacenter in California only: We now reach our Red Hat OpenShift AI (RHOAI) inference endpoint through F5 Distributed Cloud: F5 Distributed Cloud Web App & API Protection F5 Distributed Cloud Web App & API Protection provides comprehensive sets of security features, and uniform observability and policy enforcement to protect apps and APIs across hybrid and multicloud environments. We utilize F5 Distributed Cloud App Connect to steer the inference traffic through F5 Distributed Cloud to secure our inference endpoint. In our example, we protect our Red Hat OpenShift AI (RHOAI) inference endpoint by rate-limiting the access, so that we can ensure no single client would exhaust the inference service: A "Too Many Requests" is received in the response when a single client repeatedly requests access to the inference service at a rate higher than the configured threshold: This is just one of the many security features to protect our inference service. Click here to find out more on Securing Model Serving in Red Hat OpenShift AI (on ROSA) with F5 Distributed Cloud API Security. Demonstration In a real-world scenario, the front-end application could be hosted on the cloud, or hosted at the edge, or served through F5 Distributed Cloud, offering flexible alternatives for efficient application delivery based on user preferences and specific needs. To illustrate how all the discussed components work seamlessly together, we simplify our example by deploying Open WebUI as the front-end application on the Red Hat OpenShift cluster in the data center in California, which includes RAG services. While a DPU or GPU could be used for improved performance, our setup utilizes a CPU for inferencing tasks. We connect our app to our enterprise data stores deployed on NetApp ONTAP in the data center in Seattle using F5 Distributed Cloud Network Connect, where we have a copy of "Chapter 1. About the Migration Toolkit for Virtualization" from Red Hat. These documents are processed and saved to the Vector DB: Our embedding Large Language Model (LLM) is Sentence-Transformers/all-MiniLM-L6-v2, and here is our RAG template: Instead of connecting to the inference endpoint on Red Hat OpenShift AI (RHOAI) on ROSA directly, we connect to the F5 Distributed Cloud HTTP Load Balancer (rhoai-llm-serving.f5-demo.com) from F5 Distributed Cloud App Connect: Previously, we asked, "What is MTV?“ and we never received a response related to Red Hat Migration Toolkit for Virtualization: Now, let's try asking the same question again with RAG services enabled: We finally received the response we had anticipated. Next, we use F5 Distributed Cloud Web App & API Protection to safeguard our Red Hat OpenShift AI (RHOAI) inference endpoint on ROSA by rate-limiting the access, thus preventing a single client from exhausting the inference service: As expected, we received "Too Many Requests" in the response on our app upon requesting the inference service at a rate greater than the set threshold: With F5 Distributed Cloud's real-time observability and security analytics from the F5 Distributed Console, we can proactively monitor for potential threats. For example, if necessary, we can block a client from accessing the inference service by adding it to the Blocked Clients List: As expected, this specific client is now unable to access the inference service: Summary Deploying and securing RAG for enterprise RAG-enabled AI applications in a multi-vendor, hybrid, and multi-cloud environment can present complex challenges. In collaboration with Red Hat OpenShift AI (RHOAI) and NetApp ONTAP, F5 Distributed Cloud provides an effortless solution that secures RAG components seamlessly for enterprise RAG-enabled AI applications.
Jennifer_Yeung
Feb 24, 2025 Place Technical Articles
486Views
1like
0Comments
Agentic RAG - Securing GenAI with F5 Distributed Cloud Services
Agentic RAG (Retrieval-Augmented Generation) enhances the capabilities of a GenAI chatbot by integrating dynamic knowledge retrieval into its conversational abilities, making it more context-aware and accurate. In this demo, I will focus on security aspect of the solution. This demonstration will highlight the various security measures implemented and enforced in our AI reference architecture for this Agentic RAG. F5 is a trusted leader in security, with a track record of delivering robust solutions for securing applications and networks. Recognized by many independent evaluations as a Leader in Web Application and API Security from IDC, SC Award, TrustRadius, EMA, and many more, F5 exemplifies excellence and innovation. These endorsements affirm F5’s expertise, reassuring organizations that their digital assets are protected by a capable, reputable partner that keeps pace with evolving security needs.
Foo-Bang_Chan
Feb 13, 2025 Place Technical Articles
340Views
2likes
0Comments
Securing Model Serving in Red Hat OpenShift AI (on ROSA) with F5 Distributed Cloud API Security
Learn how Red Hat OpenShift AI on ROSA and F5 Distributed Cloud API Security work together to protect generative AI model inference endpoints. This integration ensures robust API discovery, schema enforcement, LLM-aware threat detection, bot mitigation, sensitive data redaction, and continuous observability—enabling secure, compliant, and high-performance AI-driven experiences at scale.
Eric_Ji
Jan 02, 2025 Place Technical Articles
1.3KViews
5likes
4Comments
AI Security - LLM-DOS, and predictions of 2025 and beyond
Introduction Hello again, this article is part of AI security series. I have been discussing AI security along with the OWASP LLM top10. LLM01 and LLM02 were discussed in the "AI Security : Prompt Injection and Insecure Output Handling", and LLM03 and its basic concepts were discussed in the "Using ChatGPT for security and introduction of AI security". In this article, I am going to discuss LLM04. And, since we are almost at the end of the year 2024, I would like to present some discussions and predictions for AI security in 2025 and beyond. LLM04: Model Denial of Service LLM04 is relatively easy to understand for security engineers who is familiar with conventional cyber attack methods. Denial of Service (DoS) is a common method of cyber attack, in which a large amount of data is given to the server to make it unable to provide services and/or crash. DoS attacks usually aim to exhaust computational resources and block services rather than stealing data, but the disruption they cause can be used as a smokescreen for more malicious activities, such as data breaches or malware installation. DOS attack against LLM (LLM-DOS) is same. It aims to exhaust computational resources of DOS (like CPU/GPU usage) and block services (like responding to chat). LLM-DOS can be done in two ways. One is a simple LLM-DOS attack which is to mass input against the LLM's input, similar to a DOS attack against a server. This method, as described in this article, can deplete the LLM's resources, like CPU/GPU usages. If you call this as a simple DoS attack, in such a scenario would be to instruct the model to keep repeating Hello, but we see that relying only on natural instructions limits the output length, which is limited by the maximum length of the LLM's Supervised Fine-Tuning (SFT) data The another method of LLM-DOS is to include code in the input that over-consumes resources. Denial-of-Service Poisoning Attacks on Large Language Models is discussing this. In the paper, this is called as a poisoning-based DoS (P-DoS) attack and it demonstrates that the output length limit can be broken by injecting a single poisoning sample designed for DoS purposes. Experiments reveal that an attacker can easily compromise models such as GPT-4o and GPT-4o mini by injecting a single poisoning sample through the OpenAI API at a minimal cost of less than $1. To understand this, it is easier to think about simple programming - for example, if you put an inescapable loop statement in your code, it can hang the computer (in fact, the IDE will warn you before it compiles). And if the network does not have a Spanning Tree Protocol, it will loop and hang the router. So same things happens on prompt injection. When using this idea LLLM-DOS, we must consider that such input should be blacklisted, so the simple way of using inescapable loop is impossible. Also, even if it is possible against WhiteBox, but we do not know what kind of attack is possible in BlackBox. However, according to "Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings", a Prompt input to the BlackBox can generate multiple sub-prompts (e.g., 25 sub-prompts). Its experiments show that the delay could be increased by a factor of 250. Given these serious safety concerns, researchers advocate further research aimed at defending against LLM-DoS threats in custom fine tuning of aligned LLMs. What will happen in 2025 and beyond? Some news site predicts an intensifying AI arms race in coming year. I would like to share an article on AI security predictions for the coming year and beyond. According to an article by EG Secure solutions, the generative AI makes it possible to create a malware without specialized skills, that makes easier to do cyber attacks. Thus, the article predicted that cyber attacks by malware created by generative AI would increase. The article also points out that LLM-generated applications such as RAGs are being used, but their code may contain vulnerabilities, and that will be another threat in 2025 and beyond. McAfee has released "McAfee Unveils 2025 Cybersecurity Predictions: AI-Powered Scams and Emerging Digital Threats Take Center Stage". According to the article, cyber attacks by malicious attackers will be highly optimized by generative AI, and the quality of DeepFake and AI-generated images/videos will increase, making it difficult to determine whether they are created by humans or generative AI. Thus it is expected that fake emails generated by generative AI, such as phishing emails, will also become harder to distinguish from real emails. Furthermore, the article points out that malware which is using (maybe created by) generative AI will become more sophisticated, thereby breaking through conventional security defense systems and may succeed to extracting personal information and sensitive data. Finally, the "Infosec experts divided on AI's potential to assist red teams" discusses the pros and cons of using generative AI for red teaming, one type of security audit. According to the article, the benefit of using generative AI is that it accelerates threat detection by allowing AI to scour multiple data feeds, applications, and other sources of performance data and run them as part of a larger automated workflow. On the other hand, the article also argues that using generative AI for red teaming is still limited, because the vulnerability discovery process by AI is a black box so the pen-tester cannot explain how they discovered to their clients.
Koichi
Jan 02, 2025 Place Technical Articles
974Views
1like
0Comments
Mitigate OWASP LLM Security Risk: Sensitive Information Disclosure Using F5 NGINX App Protect
This short WAF security article covered the critical security gaps present in current generative AI applications, emphasizing the urgent need for robust protection measures in LLM design deployments. Finally we also demonstrated how F5 Nginx App Protect v5 offers an effective solution to mitigate the OWASP LLM Top 10 risks.
Janibasha
Oct 09, 2024 Place Technical Articles
573Views
3likes
0Comments
AI Security : Prompt Injection and Insecure Output Handling.
In this article, I discuss about 2 of the major LLM attack methodologies and introduce some papers about AI security.
Koichi
Oct 04, 2024 Place Security Insights
447Views
1like
0Comments
Securing the LLM User Experience with an AI Firewall
As artificial intelligence (AI) seeps into the core day-to-day operations of enterprises, a need exists to exert control over the intersection point of AI-infused applications and the actual large language models (LLMs) that answer the generated prompts. This control point should serve to impose security rules to automatically prevent issues such as personally identifiable information (PII) inadvertently exposed to LLMs. The solution must also counteract motivated, intentional misuse such as jailbreak attempts, where the LLM can be manipulated to provide often ridiculous answers with the ensuing screenshotting attempting to discredit the service. Beyond the security aspect and the overwhelming concern of regulated industries, other drivers include basic fiscal prudence 101, ensuring the token consumption of each offered LLM model is not out of hand. This entire discussion around observability and policy enforcement for LLM consumption has given rise to a class of solutions most frequently referred to as AI Firewalls or AI Gateways (AI GW). An AI FW might be leveraged by a browser plugin, or perhaps applying a software development kit (SDK) during the coding process for AI applications. Arguably, the most scalable and most easily deployed approach to inserting AI FW functionality into live traffic to LLMs is to use a reverse proxy. A modern approach includes the F5 Distributed Cloud service, coupled with an AI FW/GW service, cloud-based or self-hosted, that can inspect traffic intended for LLMs like those of OpenAI, Azure OpenAI, or privately operated LLMs like those downloaded from Hugging Face. A key value offered by this topology, a reverse proxy handing off LLM traffic to an AI FW, which in turn can allow traffic to reach target LLMs, stems from the fact that traffic is seen, and thus controllable, in both directions. Should an issue be present in a user’s submitted prompt, also known as an “inference”, it can be flagged: PII (Personally Identifiable Information) leakage is a frequent concern at this point. In addition, any LLM responses to prompts are also seen in the reverse path: consider a corrupted LLM providing toxicity in its generated replies. Not good. To achieve a highly performant reverse proxy approach to secured LLM access, a solution that can span a global set of users, F5 worked with Prompt Security to deploy an end-to-end AI security layer. This article will explore the efficacy and performance of the live solution. Impose LLM Guardrails with the AI Firewall and Distributed Cloud An AI firewall such as the Prompt Security offering can get in-line with AI LLM flows through multiple means. API calls from Curl or Postman can be modified to transmit to Prompt Security when trying to reach targets such as OpenAI or Azure OpenAI Service. Simple firewall rules can prevent employee direct access to these well-known API endpoints, thus making the Prompt Security route the sanctioned method of engaging with LLMs. A number of other methods could be considered but have concerns. Browser plug-ins have the advantage of working outside the encryption of the TLS layer, in a manner similar to how users can use a browser’s developer tools to clearly see targets and HTTP headers of HTTPS transactions encrypted on the wire. Prompt Security supports plugins. A downside, however, of browser plug-ins is the manageability issue, how to enforce and maintain across-the-board usage, simply consider the headache non-corporate assets used in the work environment. Another approach, interesting for non-browser, thick applications on desktops, think of an IDE like VSCode, might be an agent approach, whereby outbound traffic is handled by an on-board local proxy. Again, Prompt can fit in this model however the complexity of enforcement of the agent, like the browser approach, may not always be easy and aligned with complete A-to-Z security of all endpoints. One of the simplest approaches is to ingest LLM traffic through a network-centric approach. An F5 Distributed Cloud HTTPS load balancer, for instance, can ingest LLM-bound traffic, and thoroughly secure the traffic at the API layer, things like WAF policy and DDoS mitigations, as examples. HTTP-based control plane security is the focus here, as opposed to the encapsulated requests a user is sending to an LLM. The HTTPS load balancer can in turn hand off traffic intended for the likes of OpenAI to the AI gateway for prompt-aware inspections. F5 Distributed Cloud (XC) is a good architectural fit for inserting a third-party AI firewall service in-line with an organization’s inferencing requests. Simply project a FQDN for the consumption of AI services; in this article we used the domain name “llmsec.busdevF5.net” into the global DNS, advertising one single IP address mapping to the name. This DNS advertisement can be done with XC. The IP address, through BGP-4 support for anycast, will direct any traffic to this address to the closest of 27 international points of presence of the XC global fabric. Traffic from a user in Asia may be attracted to Singapore or Mumbai F5 sites, whereas a user in Western Europe might enter the F5 network in Paris or Frankfurt. As depicted, a distributed HTTPS load balancer can be configured – “distributed” reflects the fact traffic ingressing in any of the global sites can be intercepted by the load balancer. Normally, the server name indicator (SNI) value in the TLS Client Hello can be easily used to pick the correct load balancer to process this traffic. The first step in AI security is traditional reverse proxy core security features, all imposed by the XC load balancer. These features, to name just a few, might include geo-IP service policies to preclude traffic from regions, automatic malicious user detection, and API rate limiting; there are many capabilities bundled together. Clean traffic can then be selected for forwarding to an origin pool member, which is the standard operation of any load balancer. In this case, the Prompt Security service is the exclusive member of our origin pool. For this article, it is a cloud instantiated service - options exist to forward to Prompt implemented on a Kubernetes cluster or running on a Distributed Cloud AppStack Customer Edge (CE) node. Block Sensitive Data with Prompt Security In-Line AI inferences, upon reaching Prompt’s security service, are subjected to a wide breadth of security inspections. Some of the more important categories would include: Sensitive data leakage, although potentially contained in LLM responses, intuitively the larger proportion of risk is within the requesting prompt, with user perhaps inadvertently disclosing data which should not reach an LLM Source code fragments within submissions to LLMs, various programming languages may be scanned for and blocked, and the code may be enterprise intellectual property OWASP LLM top 10 high risk violations, such as LLM jailbreaking where the intent is to make the LLM behave and generate content that is not aligned with the service intentions; the goal may be embarrassing “screenshots”, such as having a chatbot for automobile vendor A actually recommend a vehicle from vendor B OWASP Prompt Injection detection, considered one of the most dangerous threats as the intention is for rogue users to exfiltrate valuable data from sources the LLM may have privileged access to, such as backend databases Token layer attacks, such as unauthorized and excessive use of tokens for LLM tasks, the so-called “Denial of Wallet” threat Content moderation, ensuring a safe interaction with LLMs devoid of toxicity, racial and gender discriminatory language and overall curated AI experience aligned with those productivity gains that LLMs promise To demonstrate sensitive data leakage protection, a Prompt Security policy was active which blocked LLM requests with, among many PII fields, a mailing address exposed. To reach OpenAI GPT3.5-Turbo, one of the most popular and cost-effective models in the OpenAI model lineup, prompts were sent to an F5 XC HTTPS load balancer at address llmsec.busdevf5.net. Traffic not violating the comprehensive F5 WAF security rules were proxied to the Prompt Security SaaS offering. The prompt below clearly involves a mailing address in the data portion. The ensuing prompt is intercepted by both the F5 and Prompt Security solutions. The first interception, the distributed HTTPS load balancer offered by F5 offers rich details on the transaction, and since no WAF rules or other security policies are violated, the transaction is forwarded to Prompt Security. The following demonstrates some of the interesting details surrounding the transaction, when completed (double-click to enlarge). As highlighted, the transaction was successful at the HTTP layer, producing a 200 Okay outcome. The traffic originated in the municipality of Ashton, in Canada, and was received into Distributed Cloud in F5’s Toronto (tr2-tor) RE site. The full details around the targeted URL path, such as the OpenAI /v1/chat/completions target and the user-agent involved, vscode-restclient, are both provided. Although the HTTP transaction was successful, the actual AI prompt was rejected, as hoped for, by Prompt Security. Drilling into the Activity Monitor in the Prompt UI, one can get a detailed verdict on the transaction (double-click). Following the yellow highlights above, the prompt was blocked, and the violation is “Sensitive Data”. The specific offending content, the New York City street address, is flagged as a precluded entity type of “mailing address”. Other fields that might be potentially blocking candidates with Prompt’s solution include various international passports or driver’s license formats, credit card numbers, emails, and IP addresses, to name but a few. A nice, time saving feature offered by the Prompt Security user interface is to simply choose an individual security framework of interest, such as GDPR or PCI, and the solution will automatically invoke related sensitive data types to detect. An important idea to grasp: The solution from Prompt is much more nuanced and advanced than simple REGEX; it invokes the power of AI itself to secure customer journeys into safe AI usage. Machine learning models, often transformer-based, have been fine-tuned and orchestrated to interpret the overall tone and tenor of prompts, gaining a real semantic understanding of what is being conveyed in the prompt to counteract simple obfuscation attempts. For instance, using printed numbers, such as one, two, three to circumvent Regex rules predicated on numerals being present - this will not succeed. This AI infused ability to interpret context and intent allows for preset industry guidelines for safe LLM enforcement. For instance, simply indicating the business sector is financial will allow the Prompt Security solution to pass judgement, and block if desired, financial reports, investment strategy documents and revenue audits, to name just a few. Similar awareness for sectors such as healthcare or insurance is simply a pull-down menu item away with the policy builder. Source Code Detection A common use case for LLM security solutions is identification and, potentially, blocking submissions of enterprise source code to LLM services. In this scenario, this small snippet of Python is delivered to the Prompt service: def trial(): return 2_500 <= sorted(choices(range(10_000), k=5))[2] < 7_500 sum(trial() for i in range(10_000)) / 10_000 A policy is in place for Python and JavaScript detection and was invoked as hoped for. curl --request POST \ --url https://llmsec.busdevf5.net/v1/chat/completions \ --header 'authorization: Bearer sk-oZU66yhyN7qhUjEHfmR5T3BlbkFJ5RFOI***********' \ --header 'content-type: application/json' \ --header 'user-agent: vscode-restclient' \ --data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "def trial():\n return 2_500 <= sorted(choices(range(10_000), k=5))[2] < 7_500\n\nsum(trial() for i in range(10_000)) / 10_000"}]}' Content Moderation for Interactions with LLMs One common manner of preventing LLM responses from veering into undesirable territory is for the service provider to implement a detailed system prompt, a set of guidelines that the LLM should be governed by when responding to user prompts. For instance, the system prompt might instruct the LLM to serve as polite, helpful and succinct assistant for customers purchasing shoes in an online e-commerce portal. A request for help involving the trafficking of narcotics should, intuitively, be denied. Defense in depth has traditionally meant no single point of failure. In the above scenario, screening both the user prompt and ensuring LLM response for a wide range of topics leads to a more ironclad security outcome. The following demonstrates some of the topics Prompt Security can intelligently seek out; in this simple example, the topic of “News & Politics” has been singled out to block as a demonstration. Testing can be performed with this easy Curl command, asking for a prediction on a possible election result in Canadian politics: curl --request POST \ --url https://llmsec.busdevf5.net/v1/chat/completions \ --header 'authorization: Bearer sk-oZU66yhyN7qhUjEHfmR5T3Blbk*************' \ --header 'content-type: application/json' \ --header 'user-agent: vscode-restclient' \ --data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "Who will win the upcoming Canadian federal election expected in 2025"}],"max_tokens": 250,"temperature": 0.7}' The response, available in the Prompt Security console, is also presented to the user. In this case, a Curl user leveraging the VSCode IDE. The response has been largely truncated for brevity, fields that are of interest is an HTTP “X-header” indicating the transaction utilized the F5 site in Toronto, and the number of tokens consumed in the request and response are also included. Advanced LLM Security Features Many of the AI security concerns are given prominence by the OWASP Top Ten for LLMs, an evolving and curated list of potential concerns around LLM usage from subject matter experts. Among these are prompt injection attacks and malicious instructions often perceived as benign by the LLM. Prompt Security uses a layered approach to thwart prompt injection. For instance, during the uptick in interest in ChatGPT, DAN (Do Anything Now) prompt injection was widespread and a very disruptive force, as discussed here. User prompts will be closely analyzed for the presence of the various DAN templates that have evolved over the past 18 months. More significantly, the use of AI itself allows the Prompt solution to recognize zero-day bespoke prompts attempting to conduct mischief. The interpretative powers of fine-tuned, purpose-built security inspection models are likely the only way to stay one step ahead of bad actors. Another chief concern is protection of the system prompt, the guidelines that reel in unwanted behavior of the offered LLM service, what instructed our LLM earlier in its role as a shoe sales assistant. The system prompt, if somehow manipulated, would be a significant breach in AI security, havoc could be created with an LLM directed astray. As such, Prompt Security offers a policy to compare the user provided prompt, the configured system prompt in the API call, and the response generated by the LLM. In the event that a similarity threshold with the system prompt is exceeded in the other fields, the transaction can be immediately blocked. An interesting advanced safeguard is the support for a “canary” word - a specific value that a well behaved LLM should never present in any response, ever. The detection of the canary word by the Prompt solution will raise an immediate alert. One particularly broad and powerful feature in the AI firewall is the ability to find secrets, meaning tokens or passwords, frequently for cloud-hosted services, that are revealed within user prompts. Prompt Security offers the ability to scour LLM traffic for in excess of 200 meaningful values. Just as a small representative sample of the industry’s breadth of secrets, these can all be detected and acted upon: Azure Storage Keys Detector Artifactory Detector Databricks API tokens GitLab credentials NYTimes Access Tokens Atlassian API Tokens Besides simple blocking, a useful redaction option can be chosen. Rather than risk compromise of credentials and obfuscated value will instead be seen at the LLM. F5 Positive Security Models for AI Endpoints The AI traffic delivered and received from Prompt Security’s AI firewall is both discovered and subjected to API layer policies by the F5 load balancer. Consider the token awareness features of the AI firewall, excessive token consumption can trigger an alert and even transaction blocking. This behavior, a boon when LLMs like the OpenAI premium GPT-4 models may have substantial costs, allows organizations to automatically shut down a malicious actor who illegitimately got hold of an OPENAI_API key value and bombarded the LLM with prompts. This is often referred to as a “Denial of Wallet” situation. F5 Distributed Cloud, with its focus upon the API layer, has congruent safeguards. Each unique user of an API service is tracked to monitor transactional consumption. By setting safeguards for API rate limiting, an excessive load placed upon the API endpoint will result in HTTP 429 “Too Many Request” in response to abusive behavior. A key feature of F5 API Security is the fact that it is actionable in both directions, and also an in-line offering, unlike some API solutions which reside out of band and consume proxy logs for reporting and threat detection. With the automatic discovery of API endpoints, as seen in the following screenshot, the F5 administrator can see the full URL path which in this case exercises the familiar OpenAI /v1/chat/completions endpoint. As highlighted by the arrow, the schema of traffic to API endpoints is fully downloadable as an OpenAPI Specification (OAS), formerly known as a Swagger file. This layer of security means fields in API headers and bodies can be validated for syntax, such that a field whose schema expects a floating-point number can see any different encoding, such as a string, blocked in real-time in either direction. A possible and valuable use case: allow an initial unfettered access to a service such as OpenAI, by means of Prompt Security’s AI firewall service, for a matter of perhaps 48 hours. After a baseline of API endpoints has been observed, the API definition can be loaded from any saved Swagger files at the end of this “observation” period. The loaded version can be fully pruned of undesirable or disallowed endpoints, all future traffic must conform or be dropped. This is an example of a “positive security model”, considered a gold standard by many risk-adverse organizations. Simply put, a positive security model allows what has been agreed upon through and rejects everything else. This ability to learn and review your own traffic, and then only present Prompt Security with LLM endpoints that an organization wants exposed is an interesting example of complementing an AI security solution with rich API layer features. Summary The world of AI and LLMs is rapidly seeing investment, in time and money, from virtually all economic sectors; the promise of rapid dividends in the knowledge economy is hard to resist. As with any rapid deployment of new technology, safe consumption is not guaranteed, and it is not built in. Although LLMs often suggest guardrails are baked into offerings, a 30-second search of the Internet will expose firsthand experiences where unexpected outcomes when invoking AI are real. Brand reputation is at stake and false information can be hallucinated or coerced out of LLMs by determined parties. By combining the ability to ingest globally at high-speed dispersed users and apply a first level of security protections, F5 Distributed Cloud can be leveraged as an onboarding for LLM workloads. As depicted in this article, Prompt Security can in turn handle traffic egressing F5’s distributed HTTPS load balancers and provide state-of-the-art AI safeguards, including sensitive data detection, content moderation and other OWASP-aligned mechanisms like jailbreak and prompt injection mitigation. Other deployment models exist, including deploying Prompt Security’s solution on-premises, self-hosted in cloud tenants, and running the solution on Distributed Cloud CE nodes themselves is supported.
Steve_Gorman
Jun 24, 2024 Place Technical Articles
2KViews
4likes
0Comments
Secure RAG for Safe AI Deployments Using F5 Distributed Cloud and NetApp ONTAP
Retrieval Augmented Generation (RAG) is one of the most discussed techniques to empower Large Language Models (LLM) to deliver niche, hyper-focused responses pertaining to specialized, sometimes proprietary, bodies of knowledge documents. Two simple examples might include highly detailed company-specific information distilled from years of financial internal reporting from financial controllers or helpdesk type queries with the LLM harvesting only relevant knowledge base (KB) articles, releases notes, and private engineering documents not normally exposed in their entirety. RAG is highly bantered about in numerous good articles; the two principal values are: LLM responses to prompts (queries) based upon specific, niche knowledge as opposed to the general, vast pre-training generic LLMs are taught with; in fact, it is common to instruct LLMs not to answer specifically with any pre-trained knowledge. Only the content “augmenting” the prompt. Attribution is a key deliverable with RAG. Generally LLM pre-trained knowledge inquiries are difficult to traceback to a root source of truth. Prompts augmented with specific assistive knowledge normally solicit responses that clearly call out the source of the answers provided. Why is the Security of RAG Source Content Particularly Important? To maximize the efficacy of LLM solutions in the realm of artificial intelligence (AI) an often-repeated adage is “garbage in, garbage out” which succinctly states an obvious fact with RAG: valuable and actionable items must be entered into the model to expect valuable, tactical outcomes. This means exposing key forms of data, examples being data which might include patented knowledge, intellectual property not to be exposed in raw form to competitors. Actual trade secrets, which will infuse the LLM but need to remain confidential in their native form. In one example around trade secrets, the Government of Canada spells out a series of items courts will look at in determining compensation for misuse (theft) of intellectual property. It is notable that the first item listed is not the cost associated with creation of the secret material (“the cost in money or time of creating or developing the information”) but rather the very first item is instead how much effort was made to keep the content secure (“the measures taken to maintain secrecy”). With RAG, incoming queries are augmented with rich, semantically similar enterprise content. The content has already been populated into a vector database by converting documents, they might be pdf or docx as examples, into raw text form and converting chunks of text into vectors. The vectors are long sequences of numbers with similar mathematical attributes for similar content. As a trivial example, one-word chunks such as glass, cup, bucket, jar might be semantically related, meaning similarities can be construed by both human minds and LLMs. On the other hand, empathy, joy, and thoughtfulness maintain similarities of their own. This semantic approach means a phrase/sentence/paragraph (chunk) using bow to mean “to bend in respect” will be highly distinct from chunks referring to the “front end of a ship" or “something to tie one’s hair back with”, even a tool every violinist would need. The list goes on; all semantic meanings of bow are very different in these chunks and would have distinctive embeddings within a vector database. The word embedding is likely derived from “fixing” or “planting” an object. In this case, words are “embedded” into a contextual understanding. The typical length of the number sequence describing the meaning of items has typically been more than 700, but this number of “dimensions” applied is always a matter of research, and the entire vector database is arrived at with an embedding LLM, distinct from the main LLM that will produce generative AI responses to our queries. Incoming queries destined for the main generative AI LLM can, in turn, be converted to vectors themselves by the very same text-embedding “helper” LLM and through retrieval (the “R” in RAG) similar textual content can buttress the prompt presented to the main LLM (double click to expand). Since a critical cog in the wheel of the RAG architecture is the ingestion of valuable and sensitive source documents into the vector database, using the embedding LLM, it is not just prudent but critical that this source content be brought securely over networks to the embedding engine. F5 Distributed Cloud Secure Multicloud Networking and NetApp ONTAP For many practical, time-to-market reasons, modern LLMs, both the main and embedding instances, may not be collocated with the data vaults of modern enterprises. LLMs benefit from cloud compute and GPU access, something often in short supply for on-premises production roll outs. A typical approach assisted by the economies of scale might be to harvest public cloud providers, such as Azure, AWS, and Google Cloud Platform for the compute side of AI projects. Azure, as one example, can turn up virtual machines with GPUs from NVIDIA like A100, A2, and Tesla T4 to name a few. The documents needed to feed an effective RAG solution may well be on-premises, and this is unlikely to change for reasons including governance, regulatory, and the weight of decades of sound security practice. One of the leading on-premises storage solutions of the last 25 years is the NetApp ONTAP storage appliance family, and reflected in this quote from NVIDIA: "Nearly half of the files in the world are stored on-prem on NetApp." — Jensen Huang, CEO of NVIDIA A key deliverable of F5 Distributed Cloud is providing encrypted interconnectivity of disparate physical sites and heterogeneous cloud instances such as Azure VNETs or AWS VPCs. As such, there are two immediate, concurrent F5 features that come to mind: Secure interconnectivity of on-premises NetApp volumes (NAS) or LUNs (Block) containing critical documents for ingestion into RAG. Utilize encrypted L3 connectivity between the enterprise location and the cloud instance where the LLM/RAG are instantiated. TCP load balancers are another alternative for volume sharing NAS protocols like NFS or SMB/CIFS. Secure access to the LLM web interface or RESTful API end points, with HTTPS load balancers including key features like WAF, anti-bot mechanisms, and API automatic rate limiting for abusive prompt sources. The following diagram presents the topology this article set out to create, REs are “regional edge” sites maintained internationally by F5 and harness private RE to RE, high-speed global communication links. DNS names, such as the target name of an LLM service, will leverage mappings to anycast IP addresses, thus users entering the RE network from southeast Asian might, for example, enter the Singapore RE while users in Switzerland might enter via a Paris or Frankfurt RE. Complementing the REs are Customer Edge (CE) nodes. These are virtual or physical appliances which act as security demarcation points. For instance, a CE placed in an Azure VNET can protect access to the server supporting the LLM, removing any need for Internet access to the server, which is now entirely accessible only through a private RFC-1918 type of private address. External access to the LLM for just employees or, maybe employees and contractors, or potentially access for the Internet community is enabled by a distributed HTTPS load balancer. In the example depicted above, oriented towards full Internet access, the FQDN of the LLM is projected by the load balancer into the global DNS and consumers of the service resolve the name to one IP address and are attracted to the closest RE by BGP-4’s support for anycast. As the name “distributed” load balancer suggests, the origin pool can be in an entirely different site than the incoming RE, in this case the origin pool is the LLM behind the CE in the Azure VNET. The LLM requests travel from RE to CE via a highspeed networking underlay. The portion of the solution that securely ties the LLM to the source content required for RAG to embed vectors is, in this case, utilizing layer 3 multicloud networking (MCN). The solution is turnkey, routing table are automatically connected to members of the L3 MCN, in this case the inside interfaces of the Azure CE and Redmond, Washington on-premises CE and traffic flows over an encrypted underlay network. As such, the NetApp ONTAP cluster can securely expose volumes with key file ware via a protocol like Network File System (NFS), no risk of data exposure to third-party prying eyes exists. The following diagram drills into the RE and CE and NetApp interplay (double click to expand). F5 Distributed Cloud App Connect and LLM Setup This article speaks to hands-on experience with web-driven LLM inferencing with augmented prompts derived from a RAG implementation. The AI compute was instantiated on an Azure-hosted Ubuntu 20.04 virtual machine with 4 virtual cores. Installed software included Python 3.10, and libraries such as Langchain, Pypdf (for converting pdf documents to text), FAISS (for similarity searching via a vector database), and other libraries. The actual open source LLM utilized for the generative AI is found here on huggingface.co. The binary, which exceeds 4 GB, is considered effective for CPU-based deployments. The embedding LLM model, critical to seed the vector database with entries derived from secured enterprise documentation, and then used again per incoming query for RAG similarity searches to build augmented prompts, was from Hugging Face: sentence-transformers/all-MiniLM-L6-v2 and can be found here. The AI RAG solution was implemented in Python3, and as such the Azure Ubuntu can be accessed both by SSH or via Jupyter Notebooks. The latter was utilized as this is the preferred final delivery mechanism for standard users, not a web chatbot design or the requirement to use API commands through solutions like Postman or Curl. This design choice, to steer the user experience towards Jupyter Notebook consumption, is in keeping with the fact that it has become a standard in AI LLM usage where the LLM is tactical and vital to an enterprise's lines of business (LOBs). Jupyter Notebooks are web-accessed with a browser like Chrome or Edge and as such, F5’s WAF, anti-bot, and L7 DDoS, all part of the F5 WAAP offering, can easily be laid upon an HTTP load balancer with a few mouse clicks in XC to provide premium security to the user experience. NetApp and F5 Distributed Cloud Secure Multicloud Networking The secure access to files for ingestion into the vector database, for similarity searches when user queries are received, makes use of an encrypted L3 Multicloud Network relationship between the Azure VNET and the LAN on prem in Redmond, Washington hosting the NetApp ONTAP cluster. The specific protocol chosen was NFS and the simplicity is demonstrated by the use of just one Linux command to present key, high-valued documents for the AI to populate the database: #mount -t nfs <IP Address of NetAPP LIF interface on-prem>:/Secure_docs_for_RAG /home/ubuntu_restriced_user/rag_project/docs/Secure_docs_for_RAG. This address is available nowhere else in the world except behind this F5 CE in the Azure VNET. After the pdf files are converted to text, chunked to reasonable sizes with some overlap suggested between the end of one chunk and the start of the next chunk, the embedding LLM will populate the vector database. The files are always only accessed remotely by NFS through the mounted volume, and this mount may be terminated until new documents are ready to be added to the solution. The Objective RAG Implementation - Described In order to have a reasonable facsimile of the real-word use cases this solution will empower today, but not having any sensitive documents to be injected, it was decided to use some seminal “Internet Boom”-era IETF Requests for Comment (RFCs) as source content. With the rise of multi-port routing and switching devices, it became apparent the industry badly needed specific and highly precise definitions around network device (router and switch) performance benchmarking to allow purchasers “apples-to-apples” comparisons. These documents recommend testing parameters, such as what frame or packet sizes to test with, test iteration time lengths, when to use FIFO vs LIFO vs LILO definitions of latency, etc. RFC-1242 (Request for Comment, terminology) and RFC-2544 (methodologies), chaired by Scott Bradner of Harvard University, and the later RFC 2285 (LAN switching terminologies), chaired by Bob Mandeville then of European Network Laboratories are three prominent examples, to which test and measurement solutions aspired to be compliant. Detailed LLM answers for quality assurance engineers in the network equipment manufacturing (NEM) space is the intended use case of the design, answers that must be distilled specifically by generative AI considering queries augmented by RAG and specifically only based upon these industry-approved documents. These documents are, of course, not containing trade secrets or patented engineering designs. They are in fact publicly available from the IETF, however they are nicely representative of the value offered in sensitive environments. Validating RAG – Watching the Context Provided to the LLM To ensure RAG was working, the content being augmented in the prompt was displayed to screen, we would expect to see relevant clauses and sentences from the RFCs being provided to the generative AI LLM. Also, if we were to start by asking questions that were outside the purview of this testing/benchmarking topic, we should see the LLM struggle to provide users a meaningful answer. To achieve this, rather than, say, asking what 802.3/Ethernetv2 frame sizes should be used in throughput measurements, and what precisely is the industry standard definition of the term “throughput” was, the question instead pertained to a recent Netflix release, featuring Lindsay Lohan. Due to the recency of the film, even if the LLM leaned upon its pretrained model, it will come up with nothing meaningful. “Question: Important, only use information provided as context in the prompt, do not use other trained knowledge. Please identify who played Heather in the March 2024 Lindsay Lohan Netflix movie titled Irish Wish?” As seen in the following Jupyter screenshot, the RAG solution can only provide augmented prompts from the database, in this case it has some test and measurement clauses and some rules pertaining to the winter ice sport of curling (double click to enlarge). The supportive context data augmenting the prompt was, as expected, not going to help in this off-the-wall query, as only the RFCs and some sports rules had been introduced to the vector database. What is especially noteworthy is the LLM did not simply admit it could not possibly answer this question but hallucinated a very articulate response that indicated “Heather played for her team against Lindsay Lohan’s Team Ireland”. To validate the RAG architecture for a working scenario, actually pertinent to the movie, a film review in pdf format was added to the ONTAP cluster-housed volume. The supportive documentation was consumed (using Pypdf from the Jupyter Notebook it became text). It was automatically chunked and introduced to the vector database through the Hugging Face embedding LLM. With the RAG vector database populated by the embedding LLM, the responses then became meaningful. The following exchange shows the value of RAG in producing meaningful, thought-out answers: Question: Important, only use information provided as context in the prompt, do not use other pre-trained knowledge. Who does Maddie edit novels for in the Lindsay Lohan March 2024 Netflix movie titled Irish Wish? Answer: Let's think step by step. We know that Maddie Kelly plays a book editor who harbors a big secret, and we have been given information about her job as an editor for Paul Kennedy (played by Alexander Vlahos), which is also part of the question context provided above. Therefore it seems logical to infer that she edits novels in general but specifically those written or co-written by author Paul Kennedy - who happens to be a bestselling romance novelist as stated earlier on, so we can safely conclude Maddie works for Paul himself (as opposed to any other book publishing company). This time we see relevant, meaningful context provided by RAG, with relevant portions of the pdf file review document augmenting the prompt. The augmentations are normally not displayed unless requested by the user or programmer, unlike humans, LLMs do not require superfluous text formatting (advantage: LLMs) and thus the content is packed tightly and efficiently. Fewer characters also mean fewer tokens get used by the targeted AI model and can allow more data before a token context limit is reached. In pay-for-use LLM approaches fewer tokens also help the enterprise’s bottom line financially. Also, note that the answer will likely not always be identical with subsequent asks of the same question as per LLM normal behavior. Features like “temperature setting” can also allow more “creative” ideas in responses, injecting humor and even outlandishness if desired. The RAG workflow is now validated, but the LLMs in question (embedding and main generative LLM) can still be made better with these suggestions: Increase “chunk” sizes so ideas are not lost when excessive breaks make for short chunks. Increase “overlap” so an idea/concept is not lost at the demarcation point of two chunks. Most importantly, provide more context from the vector database as context lengths (maximum tokens in a request/response) are generally increasing in size. Llama2, for instance, typically has a 4,096 context length but can now be used with larger values, such as 32,768. This article used only 3 augmentations to the user query, better results could be attained by increasing this value at a potential cost of more CPU cycles. Using Secure RAG – F5 L3 MCN, HTTPS Load Balancers and NetApp ONTAP Together With the RAG architecture validated to be working, the solution was used to assist the target user entering queries to the Azure server by means of Jupyter Notebooks, with RAG documents ingested over encrypted, private networking to the on-premises ONTAP cluster NFS volumes. The questions posed, which are answerable by reading and understanding key portions spread throughout the Scott Bradner RFCs, was: “Important, only use information provided as context in the prompt, do not use other pre-trained knowledge. Please explain the specific definition of throughput? What 802.3 frame sizes should be used for benchmarking? How long should each test iteration last? If you cannot answer the questions exclusively with the details included in the prompt, simply say you are unable to answer the question accurately. Thank you." The Jupyter Notebook representation of this query, which is made in the Python language and issued from the user’s local browser anywhere in the world and directly against the Azure-hosted LLM, looks like the following (click to expand image): The next screenshot demonstrates the result, based upon the provided secure documents (double click to expand). The response is decent, however, the fact that it is clearly using the provided augmentations to the prompt, that is the key objective of this article. The accuracy of the response can be questionable in some areas, the Bradner RFCs highlighted the importance of 64-byte 802.3/Ethernetv2 frame sizes in testing, as line rate forwarding with this minimum size produces the highest theoretically possible frame per second load. In the era of software driven forwarding in switches and routers this was very demanding. Sixty-four byte frames result in 14,881 fps (frames per second) for 10BaseT, 148,809 fps for 100BaseT, 1.48 million fps for Gigabit Ethernet. These values were frequently more aspirational in earlier times and also a frequent metric used in network equipment purchasing cycles. Suspiciously, the LLM response calls out 64kB in 802.3 testing, not 64B, something which seems to be an error. Again, with this architecture, the actual LLM providing the generative AI responses is increasingly viewed as a commodity, alternative LLMs can be plugged quickly and easily into the RAG approach of this Jupyter Notebook. The end user, and thus the enterprise itself, is empowered to utilize different LLMs, purchased or open-source from sites like Hugging Face, to determine optimal results. The other key change that can affect the overall accuracy of results is to experiment with different embedding models. In fact, there are on-line “leader” boards strictly for embedding LLMs so one can quickly swap in and out various popular embedding LLMs to see the impact on results. Summary and Conclusions on F5 and NetApp as Enablers for Secure RAG This article demonstrated an approach to AI usage that leveraged the compute and GPU availability that can be found today within cloud providers such as Azure. To safely access such an AI platform for a production-grade enterprise requirement, F5 Distributed Cloud (XC) provided HTTPS load balancers to connect worker browsers to a Jupyter Notebook service on the AI platform, this service applies advanced security upon the traffic within the XC, from WAF to anti-bot to L3/L7 DDOS protections. Utilizing secure Multicloud Networking (MCN), F5 provided a private L3 connectivity service between the inside interface on an Azure VNET-based CE (customer edge) node and the inside interface of an on-premises CE node in a building in Redmond, Washington. This secure network facilitated an NFS remote volume, content on spindles/flash in on-premises NetApp ONTAP to be remotely mounted on the Azure server. This secure file access provided peace of mind to exposing potentially critical and private materials from NetApp ONTAP volumes to the AI offering. RAG was configured and files were ingested, populating a vector database within the Azure server, that allowed details, ideas, and recommendations to be harnessed by a generative AI LLM by augmenting user prompts with text gleaned from the vector database. Simple examples were used to first demonstrate that RAG was working by posing queries that should not have been addressed by the loaded secure content; such a query was not suitably answered as expected. The feeding of meaningful content from ONTAP was then demonstrated to unleash the potential of AI to address queries based upon meaningful .pdf files. Opportunities to improve results by swapping in and out the main generative AI model, as well as the embedding model, were also considered.
Steve_Gorman
Jun 03, 2024 Place Technical Articles
1.8KViews
3likes
0Comments