F5 App Connect and NetApp S3 Storage – Secured Scalable AI RAG
F5 Distributed Cloud (XC) is a SaaS solution which securely exposes services to the correct service consumers, whether the endpoints of the communications are in public clouds, private clouds, or on-prem data centers. This is particularly top of mind now as AI RAG implementations are easy to set up but are really only effective when the correct, often guarded, enterprise data stores are consumed by the solution. It is a common scenario where the AI compute loads are executed in one location, on-prem or perhaps in a cloud tenant. However, the data to be ingested, embedded, and stored in a vector database to empower inferencing may be distributed through many different geographies.
The data sources to be ingested into RAG are often stored in NetApp form factors, for instance StorageGrid, a native object-first clustered solution for housing S3 buckets. Also, the ONTAP family, where frequently files are accessed through NAS protocols like NFS or SMB, today can see the RAG source content exposed as objects through S3-compliant API calls and the corresponding protocol license.
Technology Overview
The XC module App Connect leverages L4-L7 distributed load balancers to securely provide a conduit to enterprise NetApp-housed data for the centralized AI workloads leveraging RAG. The following is the setup objective for this article, although many customer edge (CE) sites exist, we aim to bring together corporate documents (objects) in a San Jose, California datacenter to a self-hosted AI/RAG solution running in a Seattle area datacenter.
The Advantage of Load Balancers to Securely Attach to Your Data
Previous articles have leveraged the XC Network Connect module to bring together elements of NetApp storage through NAS protocols like NFS in order to run AI RAG workloads, both self-hosted and through secure consumption of Azure OpenAI. The Network Connect module provides secure L3 (IP) connectivity between disparate sites. An advantage is Network Connect will support all IP-based protocol transactions between sites, and firewall rules to preclude unwanted traffic. Network Connect is great when ease of deployment is paramount, however if you know the protocols to be supported are HTTP or TCP-based read on about App Connect, a solution that can address any IP overlap that may exist between your various sites to be interconnected.
App Connect is a different take on providing connectivity. It sets up a distributed load balancer between the consumer (in our case an AI server running LLMs and a vector database) and the services required (NAS or S3 accessible remote data stores on NetApp appliances or hosted services). The load balancer may be an HTTPS offering, which allows the leveraging of F5’s years of experience in web security solutions, including an industry-leading web application firewall (WAF) component. For non-web protocols, think NFS or SMB, a TCP layer load balancer is available. Advantages are that only the subnets where the consumer exists will ever receive connectivity and advertisements around the service configured. The load balancer can also expose origin pools that are not just private IP addressed appliances, origin pools can also be Kubernetes services.
A final App Connect feature that is noteworthy, the offering provided is an L4 through L7 service (such as HTTPS), and as such the local layer 3 environment of the consumer and, in our case, storage offering is irrelevant. A complete overlap of IP, perhaps both ends are using the same 10.0.0.0/16 allotments, is acceptable, something extremely valuable within divisions of large corporations that have separately embraced similar RFC-1918 address spaces. Also, perhaps through mergers and acquisitions the net result in a major institution are widespread instances of duplicate IP spaces in use, IP renumbering projects are legendary as they are lengthy and fraught with the risks of touching core routing tables. Lastly, applications that require users to configure IP addresses into GUIs are problematic as values are dynamic, App Connect provides services by name typically and this is less burdensome for IT staff who manage applications.
A Working AI RAG Setup Using NetApp StorageGRID and XC App Connect
A Ubuntu 22.04 Linux server was configured as a hosted LLM solution in a Seattle-area datacenter. The Ollama open-source project was installed in order to serve both generative AI LLMs quickly (llama3.1, mistral and phi3 were all used for comparative results) and the required embedding LLM. The latter LLM is needed to create vector embeddings of both source enterprise documents and subsequent real-time inference query payloads. Through semantic similarity analysis, RAG will provide augmented prompts with useful and relevant enterprise data to the Ollama-served models for better AI outcomes.
Using the s3fs offering on Linux one can quickly mount s3 buckets as file systems using FUSE (file system in user space). The net result is that any S3 bucket, supported natively by NetApp StorageGRID and through a protocol license on ONTAP appliances, can now be mounted as a Linux folder for your RAG embedding pipeline to build a vector database. The key really is how to easily tie together S3-compliant data sources through your modern enterprise, no matter where they exist and the form-factor they are in. This is where XC App Connect enters the equation, dropping a modern distributed load balancer to project services across your network locations.
The first step in configuring the HTTPS Load balancer to connect sites is to enter the Multi-Cloud App Connect module of the XC console.
Once there, primarily three key items need to be configured:
- An origin pool that points at the StorageGrid nodes or the local load balancer sitting in front of the nodes, these are private addresses within our San Jose site.
- An HTTPS load balancer that ties a virtual service name (in our case the arbitrary name s3content.local), to our origin pool.
- Establish where the service name will be projected by DNS and connectivity will be allowed, the service s3content.local is not to be projected into the global DNS but rather will only be advertised by the Seattle CE inside interface, essentially making this load balancer a private offering.
Here is the origin pool setup, in our case a BIG-IP is being used as a local load balancer for StorageGRID and thus its private San Jose datacenter address is used.
To achieve the second item, an HTTPS load balancer, we key in the following fields, including the FQDN of the service (s3content.local), the fact that we will provide a TLS certificate/key pair to be used by the load balancer, and the one-check option to create an HTTP-to-HTTPS redirection service too.
Lastly, the advertisement of our service will only be supported by the CE node at the Seattle site, requests for s3content.local from our AI server will resolve to the local CE node inside network interface IP address. The App Connect load balancer will ensure the underlying connectivity, through the underlay network, to the origin pool (StorageGRID) in San Jose.
RAG Using Ollama-Based Models and Remote StorageGRID Content
Various methods exist to leverage freely available, downloadable AI LLMs. One popular approach is huggingface.com whereas another can be to leverage the Ollama framework and download both embedding and generative models from ollama.com. The latter approach was followed in this article and in keeping with past explorations, Python 3 was used to manipulate AI programmatically, including the RAG indexing tasks and the subsequent inferencing jobs.
Ollama supports a Docker-like syntax when used interactively from the command line, the one embedding model and three generative models are seen below (from Ubuntu 22.04 terminal).
$ ollama ls
NAME ID SIZE MODIFIED
llama3.1:latest 42182419e950 4.7 GB 6 days ago
mistral:latest 974a74358d6 4.1 GB 6 days ago
nomic-embed-text:latest 0a109f422b47 274 MB 6 days ago
phi3:latest 4f2222927938 2.2 GB 7 days ago
The RAG tests includes ingestion of both .txt and .pdf documents provided by App Connect from NetApp StorageGRID. A private CA certificate was created using an OpenSSL-derived tool and loaded into the Seattle Linux and Windows hosts. That CA cert was then used to create a pkcs12-packaged certificate and key set for s3content.local and uploaded to the HTTPS load balancer setup on Distributed Cloud. A quick Windows-based S3browser test confirmed reachability from Seattle:
To leverage RAG document ingestion on Linux, the s3fs project was made use of. A Linux folder can be mounted using NAS protocols like NFS, a simple example might be:
#sudo mount -t nfs 10.50.0.202:/f5busdev /home/steve/f5bizdev/
Using s3fs, folders can similarly be mounted which tie back to buckets on the remote StorageGRID:
#s3fs mybucket001 /home/steve/rag-files-001 -o url=https://s3content.local:443 -o use_path_request_style -o passwd_file=/home/steve/.passwd-s3fs
At this point RAG ingestion can take place. The net effectiveness of RAG is often attributed to the quality of the embedding LLM used. In this case the Python scripts leveraged the Ollama API and the locally downloaded “nomic-embed-text” model, one that is generally considered just passable but is small and runs well on CPU-only hosts.
The Gutenberg project novel The Great Gatsby by F. Scott Fitzgerald was consumed from the NetApp remote bucket and a detailed question was asked of llama3.1, first without RAG’s augmentations, meaning just the pre-trained data was leveraged.
$ python search.py What university did Gatsby attend?
Answered without RAG: “In the novel "The Great Gatsby" by F. Scott Fitzgerald, the character of Jay Gatsby does not reveal where he attended university, as it is never explicitly stated in the book. However, it has been speculated and suggested in various adaptations and interpretations that he may have studied at St. Olaf College in Minnesota or Yale University due to some similarities with their campus architecture in the novel. But again, this remains purely speculative and not confirmed by Fitzgerald himself.”
Note, the user prompt shown (nor the system prompt) told llama3.1 that the question pertained to a specific novel, it figured that part out on its own based on the impressive base training that is baked into foundational models. However, the answer is imperfect. With RAG supporting the query the result was much better; it mimics a response of an individual who had actually just read the entire novel, cover-to-cover, and is able to provide more nuanced and accurate answers (trimmed for brevity).
$ python search.py What university did Gatsby attend?
Answered with RAG: "According to the text, Gatsby attended 'Oggsford College' in England, not Oxford University. It is also mentioned later in the text that Gatsby told Mr. Wolfshiem that he had been employed by Dan Cody on a boat, and that this occurred after World War I, but before he claimed to have attended Oggsford College (which suggests that his attendance at the college may be questionable)."
Pointing Linux mounted folders towards buckets containing pdf documents seemed to work best when the source documents were smaller. For instance, user manuals worked well, even though the embeddings focused upon textual chunks and disregarded diagrams. This script was instructed to provide attributions within the augmented text provided to the LLM, specifically the manual page number and document chunk number from that page. The following is a result using the smallest Ollama generative LLM test, the Phi3 model and quizzing it about some lawn maintenance equipment.
$ python query_data.py "Are there any concerns with the muffler on this lawn mower?"
Response with RAG: Yes, according to the given context, there is a concern about the muffler on this lawn mower. It is stated that if the engine has been running, the muffler will be hot and can severely burn you. Therefore, it is recommended to keep away from the hot muffler.
Sources: ['../rag-files-003/toro_lawn_mower_manual.pdf:12:2', '../rag-files-003/toro_lawn_mower_manual.pdf:12:1', '../rag-files-003/toro_lawn_mower_manual.pdf:21:1',
The findings were less positive with very large PDF documents. For instance, Joseph Hickey’s 1975 book A Guide to Bird Watching is available in the public domain and totals almost 300 pages and is 32 megabytes in size. Regardless of the LLM, mistral or llama3.1 included, rarely were questions taken directly from specific pages answered with precision. Questions supported by statements buried within the text, “Where was the prairie chicken first observed?” or “Have greenfinches ever been spotted in North America and how did they arrive?” all went unanswered.
Get Better Mileage from Your RAG
To optimize RAG, it’s unlikely the generative LLMs are at fault; with Ollama allowing serial tests it was quickly observed that none of Mistral, Llama3.1 or Phi3 differed when RAG struggled. The most likely route to improve responses is to experiment with the embedding LLM. The ability to derive semantic meaning for paragraphs of text can vary, Hugging Face provides a leaders board for embedding LLMs with their own appraisal via a performance scoring system in the massive text embedding benchmark (MTEB).
Other ideas are to use significantly larger chunk sizes for large documents, to reduce the overall number of vector embeddings being semantically compared, although a traditional 2,048 token context window in inferencing would limit how much augmented text can be provided per RAG-enabled prompt.
Finally, multiple ways exist to actually choose similar vector embeddings from the database, approaches like cosine similarity or Euclidean distance. In these experiments, the native feature to find “k” similar vectors is provided by ChromaDB itself, as explained here. Other methods that play this critical search for related, helpful content would include Facebook AI Similarity Search (FAISS), which uncouples the search feature from the database, reducing risks of vector DB vendor lock in. Other libraries, such as the compute-cosine-similarity library are available on-line, including support for languages like JavaScript or TypeScript.
Future experiments with better embedding models or possibly changing document chunking sizes to larger values might well produce even better results when scaling up your RAG deployment to enterprise scale and the expected larger documents.
F5 XC App Connect for Security and Dashboard Monitoring
The F5 Distributed Cloud load balancers provide a wealth of performance and security visibility. For any specific time range, just glancing at the Performance Dashboard for our HTTPS load balancer quickly gives us rich details including how much data has been moved to and from NetApp, the browser types used, and specifically which buckets are active on the StorageGRID.
Enterprises frequently invest heavily in dedicated monitoring solutions, everything from raw packet loggers to application performance measurement (APM) tools, sometimes offering PCAP exports to be consumed in turn by other tools such as Wireshark. Although Distributed Cloud load balancers are a secure communications solution, a wealth of monitoring information is available. Both directions of transactions, consisting of requests coupled with their responses as a single entity, are monitored and available in both a decoded, information pane and a rawer .json format for consumption by other tools. Here is one S3 object write from Seattle, crossing the XC HTTPS load balancer, and storing content on the San Jose StorageGRID.
The nature of the client, including browser agent (S3 Browser in the image) and TLS signature for this client are available, as well as the object name and bucket it was targeting on the NetApp appliance.
Useful Approaches for Managing and Securing Storage API Traffic
A powerful module in Distributed Cloud that locks in on NetApp traffic, in this case carried between cities by S3-compliant APIs, is the API Discovery module. As seen in the following image, an operator can add approved API endpoints to the “Inventory”, similar to adding the endpoint to a swagger file, something easily exported from the GUI and potentially integral in enterprise API documentation. As denoted in the next image, the remaining “Shadow” endpoints that were all automatically discovered, allow quick attention brought to them and is an enabler of a positive security approach whereby shadow traffic could be blocked immediately by the load balancer. In this case, a quick method of blocking unsanctioned access to specific StorageGRID buckets is arrived at.
Also worth noting in the above screenshot, the most active APIs for a selected time period, 5 minutes to a full day perhaps, are brought to the operator’s attention. Finally, a last example of the potential value of the API discovery module are the sensitive data columns. Both custom provided data types observed in flight (such as phone numbers or national health ID values) as well as non-compliance to industry-wide guidelines (such as PCI_DSS) are flagged per S3 API endpoint.
Automatic Malicious User Mitigation
Distributed Cloud also offers a valuable real-time malicious user mitigation feature. Using behavioral observations, many harnessing AI itself, rogue consumers or generators of S3 objects can be automatically blocked. This may be of particular use when the distributed HTTPS load balancer provides NetApp S3 access to a wider range of participants, think of perhaps partner enterprises with software CE sites installed at their location. Or, ramping up, consider general Internet access where the App Connect load balancer projects access to FQDNs through global DNS and the load balancer is instantiated on an international network of regional edge (RE) sites in 30+ major metropolitan markets.
This user mitigation feature can be enacted by first tying a user identity policy to the HTTPS load balancer. One valuable approach is to make use of state-of-the-art client-side TLS fingerprinting, JA4 signatures. Combining TLS fingerprinting with other elements, such as client source IP, can assist is categorizing the unique user driving any attempted transaction.
With this selection in place, an operator need only flip “on” the automatic mitigations and security will be ratcheted up for the load balancer. As seen in the following screenshot, XC and its algorithms can gauge the threat level presented by users and respond accordingly. For detected low threat levels JavaScript challenges can be presented to a client browser, something not requiring human intervention but assists in isolating DDoS attackers from legitimate service clients.
For behavior consistent with medium threat levels, something like a Captcha Challenge can be opted for, in cases where a human should be driving interactions with the service presented by the load balancer. Finally, upon the perception of high threat levels Distributed Cloud will see to it that the user is placed into a temporarily blocked state.
Summary
In this article we have demonstrated that the F5 Distributed Cloud App Connect module can be set up to provide a distributed L4-7 load balancer that can bring remote islands of storage, in this case NetApp appliances, to a centralized RAG AI compute platform. Although TCP-based NAS protocols like NFS could be utilized through TCP load balancing, this particular article focused upon the growing S3-compatible API approach to object retrieval, which uses HTTPS as transport.
Through a pair of CE sites, an S3 service was projected from San Jose to Seattle, the origin pool was a cluster of NetApp StorageGRID nodes, and the consuming entity was Ubuntu running Ollama LLMs models in support of RAG. The service was exclusively projected to Seattle and the net result was AI outcomes that consumed representative novels, research guides and product user manuals for contextually meaningful responses. The App Connect module empowers features such as rich transactional metrics, API discovery coupled with enforcement of S3 bucket access, and finally the ability to auto-mitigate worrisome users with reactions in accordance with threat risk levels.