Model Serving
3 Topics- Securing Model Serving in Red Hat OpenShift AI (on ROSA) with F5 Distributed Cloud API SecurityLearn how Red Hat OpenShift AI on ROSA and F5 Distributed Cloud API Security work together to protect generative AI model inference endpoints. This integration ensures robust API discovery, schema enforcement, LLM-aware threat detection, bot mitigation, sensitive data redaction, and continuous observability—enabling secure, compliant, and high-performance AI-driven experiences at scale.1.1KViews5likes4Comments
- Agentic RAG - Securing GenAI with F5 Distributed Cloud ServicesAgentic RAG (Retrieval-Augmented Generation) enhances the capabilities of a GenAI chatbot by integrating dynamic knowledge retrieval into its conversational abilities, making it more context-aware and accurate. In this demo, I will focus on security aspect of the solution. This demonstration will highlight the various security measures implemented and enforced in our AI reference architecture for this Agentic RAG. F5 is a trusted leader in security, with a track record of delivering robust solutions for securing applications and networks. Recognized by many independent evaluations as a Leader in Web Application and API Security from IDC, SC Award, TrustRadius, EMA, and many more, F5 exemplifies excellence and innovation. These endorsements affirm F5’s expertise, reassuring organizations that their digital assets are protected by a capable, reputable partner that keeps pace with evolving security needs.300Views2likes0Comments
- Secure AI RAG using F5 Distributed Cloud in Red Hat OpenShift AI and NetApp ONTAP EnvironmentIntroduction Retrieval Augmented Generation (RAG) is a powerful technique that allows Large Language Models (LLMs) to access information beyond their training data. The “R” in RAG refers to the data retrieval process, where the system retrieves relevant information from an external knowledge base based on the input query. Next, the “A” in RAG represents the augmentation of context enrichment, as the system combines the retrieved relevant information and the input query to create a more comprehensive prompt for the LLM. Lastly, the “G” in RAG stands for response generation, where the LLM generates a response with a more contextually accurate output based on the augmented prompt as a result. RAG is becoming increasingly popular in enterprise AI applications due to its ability to provide more accurate and contextually relevant responses to a wide range of queries. However, deploying RAG can introduce complexity due to its components being located in different environments. For instance, the datastore or corpus, which is a collection of data, is typically on-premise for enhanced control over data access and management due to data security, governance, and compliance with regulations within the enterprise. Meanwhile, inference services are often deployed in the cloud for their scalability and cost-effectiveness. In this article, we will discuss how F5 Distributed Cloud can simplify the complexity and securely connect all RAG components seamlessly for enterprise RAG-enabled AI applications deployments. Specifically, we will focus on Network Connect, App Connect, and Web App & API Protection. We will demonstrate how these F5 Distributed Cloud features can be leveraged to secure RAG in collaboration with Red Hat OpenShift AI and NetApp ONTAP. Example Topology F5 Distributed Cloud Network Connect F5 Distributed Cloud Network Connect enables seamless and secure network connectivity across hybrid and multicloud environments. By deploying F5 Distributed Cloud Customer Edge (CE) at site, it allows us to easily establish encrypted site-to-site connectivity across on-premises, multi-cloud, and edge environment. Jensen Huang, CEO of NVIDIA, has said that "Nearly half of the files in the world are stored on-prem on NetApp.”. In our example, enterprise data stores are deployed on NetApp ONTAP in a data center in Seattle managed by organization B (Segment-B: s-gorman-production-segment), while RAG services, including embedding Large Language Model (LLM) and vector database, is deployed on-premise on a Red Hat OpenShift cluster in a data center in California managed by Organization A (Segment-A: jy-ocp). By leveraging F5 Distributed Cloud Network Connect, we can quickly and easily establish a secure connection for seamless and efficient data transfer from the enterprise data stores to RAG services between these two segments only: F5 Distributed Cloud CE can be deployed as a virtual machine (VM) or as a pod on a Red Hat OpenShift cluster. In California, we deploy the CE as a VM using Red Hat OpenShift Virtualization — click here to find out more on Deploying F5 Distributed Cloud Customer Edge in Red Hat OpenShift Virtualization: Segment-A: jy-ocp on CE in California and Segment-B: s-gorman-production-segment on CE in Seattle: Simply and securely connect Segment-A: jy-ocp and Segment-B: s-gorman-production-segment only, using Segment Connector: NetApp ONTAP in Seattle has a LUN named “tbd-RAG”, which serves as the enterprise data store in our demo setup and contains a collection of data. After these two data centers are connected using F5 XC Network Connect, a secure encrypted end-to-end connection is established between them. In our example, “test-ai-tbd” is in the data center in California where it hosts the RAG services, including embedding Large Language Model (LLM) and vector database, and it can now successfully connect to the enterprise data stores on NetApp ONTAP in the data center in Seattle: F5 Distributed Cloud App Connect F5 Distributed Cloud App Connect securely connects and delivers distributed applications and services across hybrid and multicloud environments. By utilizing F5 Distributed Cloud App Connect, we can direct the inference traffic through F5 Distributed Cloud's security layers to safeguard our inference endpoints. Red Hat OpenShift on Amazon Web Services (ROSA) is a fully managed service that allows users to develop, run, and scale applications in a native AWS environment. We can host our inference service on ROSA so that we can leverage the scalability, cost-effectiveness, and numerous benefits of AWS’s managed infrastructure services. For instance, we can host our inference service on ROSA by deploying Ollama with multiple AI/ML models: Or, we can enable Model Serving on Red Hat OpenShift AI (RHOAI). Red Hat OpenShift AI (RHOAI) is a flexible and scalable AI/ML platform builds on the capabilities of Red Hat OpenShift that facilitates collaboration among data scientists, engineers, and app developers. This platform allows them to serve, build, train, deploy, test, and monitor AI/ML models and applications either on-premise or in the cloud, fostering efficient innovation within organizations. In our example, we use Red Hat OpenShift AI (RHOAI) Model Serving on ROSA for our inference service: Once inference service is deployed on ROSA, we can utilize F5 Distributed Cloud to secure our inference endpoint by steering the inference traffic through F5 Distributed Cloud's security layers, which offers an extensive suite of features designed specifically for the security of modern AI/ML inference endpoints. This setup would allow us to scrutinize requests, implement policies for detected threats, and protect sensitive datasets before they reach the inferencing service hosted within ROSA. In our example, we setup a F5 Distributed Cloud HTTP Load Balancer (rhoai-llm-serving.f5-demo.com), and we advertise it to the CE in the datacenter in California only: We now reach our Red Hat OpenShift AI (RHOAI) inference endpoint through F5 Distributed Cloud: F5 Distributed Cloud Web App & API Protection F5 Distributed Cloud Web App & API Protection provides comprehensive sets of security features, and uniform observability and policy enforcement to protect apps and APIs across hybrid and multicloud environments. We utilize F5 Distributed Cloud App Connect to steer the inference traffic through F5 Distributed Cloud to secure our inference endpoint. In our example, we protect our Red Hat OpenShift AI (RHOAI) inference endpoint by rate-limiting the access, so that we can ensure no single client would exhaust the inference service: A "Too Many Requests" is received in the response when a single client repeatedly requests access to the inference service at a rate higher than the configured threshold: This is just one of the many security features to protect our inference service. Click here to find out more on Securing Model Serving in Red Hat OpenShift AI (on ROSA) with F5 Distributed Cloud API Security. Demonstration In a real-world scenario, the front-end application could be hosted on the cloud, or hosted at the edge, or served through F5 Distributed Cloud, offering flexible alternatives for efficient application delivery based on user preferences and specific needs. To illustrate how all the discussed components work seamlessly together, we simplify our example by deploying Open WebUI as the front-end application on the Red Hat OpenShift cluster in the data center in California, which includes RAG services. While a DPU or GPU could be used for improved performance, our setup utilizes a CPU for inferencing tasks. We connect our app to our enterprise data stores deployed on NetApp ONTAP in the data center in Seattle using F5 Distributed Cloud Network Connect, where we have a copy of "Chapter 1. About the Migration Toolkit for Virtualization" from Red Hat. These documents are processed and saved to the Vector DB: Our embedding Large Language Model (LLM) is Sentence-Transformers/all-MiniLM-L6-v2, and here is our RAG template: Instead of connecting to the inference endpoint on Red Hat OpenShift AI (RHOAI) on ROSA directly, we connect to the F5 Distributed Cloud HTTP Load Balancer (rhoai-llm-serving.f5-demo.com) from F5 Distributed Cloud App Connect: Previously, we asked, "What is MTV?“ and we never received a response related to Red Hat Migration Toolkit for Virtualization: Now, let's try asking the same question again with RAG services enabled: We finally received the response we had anticipated. Next, we use F5 Distributed Cloud Web App & API Protection to safeguard our Red Hat OpenShift AI (RHOAI) inference endpoint on ROSA by rate-limiting the access, thus preventing a single client from exhausting the inference service: As expected, we received "Too Many Requests" in the response on our app upon requesting the inference service at a rate greater than the set threshold: With F5 Distributed Cloud's real-time observability and security analytics from the F5 Distributed Console, we can proactively monitor for potential threats. For example, if necessary, we can block a client from accessing the inference service by adding it to the Blocked Clients List: As expected, this specific client is now unable to access the inference service: Summary Deploying and securing RAG for enterprise RAG-enabled AI applications in a multi-vendor, hybrid, and multi-cloud environment can present complex challenges. In collaboration with Red Hat OpenShift AI (RHOAI) and NetApp ONTAP, F5 Distributed Cloud provides an effortless solution that secures RAG components seamlessly for enterprise RAG-enabled AI applications.432Views1like0Comments