Securely Scale RAG - Azure OpenAI Service, F5 Distributed Cloud and NetApp
Arguably, the easiest and most massively scalable approach to harnessing Large Language Models (LLMs) is to consume leading services like OpenAI endpoints, the most well-known of cloud-based offering delivered to enterprises over the general Internet. Access to hardware, such as GPUs, and the significant skillset to run LLMs on your own become non-issues, consumption is simply an API call away.
One concern, and a serious one, is that sensitive inferencing (AI prompts, both the requests and responses) travels "in the wild" to these LLMs found through DNS at public endpoints. Retrieval Augmented Generation (RAG) adds potentially very sensitive corporate data to prompts, to leverage AI for internal use cases, thus ratcheting up even further the uneasiness with using the general Internet as a conduit to reach LLMs. RAG is a popular method to greatly increase the accuracy and relevancy of generative AI for a company’s unique set of problems.
Finally, to leverage sensitive data with RAG, the source documents must be vectorized with similarly remote “embedding” LLMS; once again sensitive, potentially proprietary sensitive data will leave the corporate premises to leverage the large AI solutions like OpenAI or Azure OpenAI.
Unlike purveyors of locally executed models, say a repository like Huggingface.com, which allow downloading of binaries to be harnessed on local compute, industry leading solutions like OpenAI and Azure OpenAI Service are founded on the paradigm of remote compute. Beyond the complexity and resources of quickly and correctly setting up performant on-prem models one time, the choice to consume remote endpoints allows hassle-free management like models perpetually updated to latest revisions and full white-glove support available to enterprise customers consuming SaaS AI models.
In this article, an approach will be presented where, using F5 Distributed Cloud (XC) and NetApp, Azure OpenAI Service can be leveraged with privacy, where prompts are carried over secured, encrypted tunnels over XC between on-premises enterprise locations and that enterprise’s Azure VNET.
The Azure OpenAI models are then exclusively exposed as private endpoints within that VNET, nowhere else in the world. This means both the embedding LLM activity to vectorize sensitive corporate data, and the actual generative AI prompts to harness value from that data are encrypted in flight. All source data and resultant vector databases remain on-premises in well-known solutions like a NetApp ONTAP storage appliance.
Why is the Azure OpenAI Service a Practical Enabler of AI Projects?
Some of the items that distinguish Azure OpenAI Service include the following:
- Prompts sent to Azure OpenAI are not forwarded to OpenAI, the service exists within Microsoft Azure, benefiting from the performance of Microsoft’s enormous cloud computing platform
- Customer prompts are never used for training data to build new or refine existing models
- Simplified billing, think of the Azure OpenAI Service as analogous to an “all you can eat buffet”, simply harness the AI service and settle the charge incurred on a regular monthly billing cycle
With OpenAI, models are exposed at universal endpoints shared by a global audience, added HTTP headers such as the OPENAI_API_KEY value distinguish users and allow billing to occur in accordance with consumption. Azure OpenAI Service is slightly different. No models actually exist to be used until they are setup under an Azure subscription. At this point, beyond receiving an API key to identify the source user, the other major difference is unique API "base" URL (endpoint) is setup for accessing LLMs an organization wishes to use. Examples would be a truly unique enterprise endpoint for GPT-3.5-Turbo, GPT4 or perhaps an embedding LLM used in vectorization, such as the popular text-embedding-ada-002 LLM.
This second feature of Azure OpenAI Service presents a powerful opportunity to F5 Distributed Cloud (XC) customers. This stems from the fact that unlike traditional OpenAI, this per-organization, unique base URL for API communications does not have to be projected into the global DNS, reachable from anywhere on the Internet. Instead, Microsoft Azure allows the OpenAI service to be constrained to a private endpoint, accessible only from where the customer chooses. Leveraging F5 XC Multicloud Networking offers a way to secure and encrypt communications between on-premises locations and Azure subnets only available from within the organization. What does this add up to for the enterprise with generative AI projects? It means huge scalability for AI services and consuming the very much leading-edge modern OpenAI models, all in a simple manner an enterprise can now consume today with limited technical onus on corporate technology services. The sense of certainty that sensitive data is not cavalierly exposed on the Internet is a critical cog in the wheel of good data governance.
Tap Into Secure Data from NetApp ONTAP Clusters for Fortified Access to OpenAI Models
The F5 Distributed Cloud global fabric consists of points of presence in 26+ metropolitan markets worldwide, such as Paris, New York, Singapore, that are interconnected with high-speed links aggregating to more than 14 Tbps of bandwidth in total, it is growing quarterly. With the F5 multicloud networking (MCN) solution, customers can easily set up dual-active encrypted tunnels (IPSec or SSL) to two points on the global fabric. The instances connected to are referred to as RE’s (Regional Edge nodes) and the customer-side sites are made up of CE’s (Customer Edge nodes, scalable from one to a full cluster). The service is a SaaS solution and setup is turn-key based upon menu click-ops or Terraform.
The customer sites, beyond being in bricks-and-mortar customer data centers and office locations, can also exist within cloud locations such as Microsoft Azure Resource Groups or AWS VPCs, among others. Enterprise customers with existing bandwidth solutions may choose to directly interconnect sites as opposed to leveraging the high-speed F5 global fabric.
The net result of an F5 XC Layer 3 multicloud network is high-speed, encrypted communications between customer sites. By disabling the default network access provided by Azure OpenAI Service, and only allowing private endpoint access, one can instantiate a private approach to running workloads with well-known OpenAI models.
With this deployment in place, customers may tap into years of data acquired and stored on trusted on-premises NetApp storage appliances to inject value into AI use cases, customized and enhanced inference results using well-regarded, industry-leading OpenAI models. A perennial industry leader in storage is ONTAP from NetApp, a solution that can safely expose volumes to file systems, through protocols such as NFS and SMB/CIFS. The ability to also expose LUNs, meaning block-level data that constitutes remote disks, is also available using protocols like iSCSI.
In the preceding diagram, one can leverage AI through a standard Python approach, in the case shown harnessing an Ubuntu Linux server, and volumes provided by ONTAP. AI jobs, rather than calling out to an Internet-routed Azure OpenAI public endpoint can instead interact with a private endpoint, one which resolves through private DNS to an address on a subnet behind a customer Azure CE node. This endpoint cannot be reached from the Internet, it is restricted to only communicating with customer subnets (routes) located in the L3 multicloud deployment.
In use cases that leverage one’s own data, a leading approach is Retrieval Augmented Generation (RAG) in order to empower Large Language Models (LLMs) to deliver niche, hyper-focused responses pertaining to specialized, sometimes proprietary, documents representing the corporate body of knowledge. Simple examples might include highly detailed, potentially confidential, company-specific information distilled from years of financial internal reporting. Another prominent early use case of RAG is to backstop frontline, customer helpdesk employees. With customers sensitive to delays in handling support requests, and pressure to reduce support staff research delays, the OpenAI LLM can harvest only relevant knowledge base (KB) articles, releases notes, and private engineering documents not normally exposed in their entirety. The net result is a much more effective helpdesk experience, with precise, relevant help provided to the support desk employee in seconds.
RAG Using Microsoft Azure OpenAI, F5 and NetApp in Nutshell
In the sample deployment, one of the more important items to recognize is that two OpenAI models will be harnessed, an embedding LLM and a generative transformer based GPT family LLM. A simple depiction of RAG would be as follows:
Using OpenAI Embedding LLMs
The OpenAI embedding model text-embedding-ada-002 is used first to vectorize data sourced from the on-premises ONTAP system, via NFS volumes mounted to the server hosting Python. The embedding model consumes “chunks” of text from each sourced document and converts the text to numbers, specifically long sequences of numbers, typically in the range of 700 to 1,500 values. These are known as vectors. The vectors returned in the private OpenAI calls are then stored in a vector database, in this case ChromaDB was used. It is important to note, the ChromaDB itself was directed to install itself within a volume supported by the on-premises ONTAP cluster, as such the content at rest is governed by the same security governance as the source content in its native format. Other common industry solutions for vector storage and searches include Milvus and for those looking to cloud-hosted vectors Pinecone.
Vector databases are purpose-built to manage vector embeddings. Conventional databases can, in fact, store vectors but the art of doing a semantic search, finding similarities between vectors, would then require vector indices solutions. One of the best known in FAISS (Facebook AI Similarity Search) which is a library that allows developers to quickly search for embeddings of multimedia documents. These semantic searches would otherwise be inefficient or impossible with standard database engines (SQL).
When a prompt is first generated by a client, the text in the prompt is vectorized by the very same OpenAI embedding model, producing a vector on the fly. The key to RAG, the “retriever” function, then compares the newly arrived query with semantically similar text chunks in the database. The actual semantic similarity of the query and previously stored chunks is arrived at through a nearest neighbor search of the vectors, in other words, phrases and sentences that might augment the original prompt can be provided to the OpenAI GPT model.
The art of finding semantic similarities relies upon comparing the lengthy vectors. The objective, for instance, to find supportive text around the user query “how to nurture shrub growth” might reasonably align more closely with a previously vectorized paragraph that included “gardening tips for the North American spring of 2024” and less so with vectorized content stemming from a user guide for the departmental photocopy machine.
The suspected closeness of vectors, are text samples actually similar topic wise, is a feature of semantic similarity search algorithms, many exist in the marketplace and two approaches commonly leveraged are cosine similarity and Euclidean distance; a brief description for those interested can be found here. The source text chunks corresponding to vectors are retained in the database and it is this source text that augments the prompt after the closest neighbor vectors are calculated.
Using OpenAI GPT LLMs
Generative Pre-trained Transformer (GPT) refers to a family of LLMs created by OpenAI that are built on a transformer architecture. The specific OpenAI model used in this model is not necessarily the latest, premium model, GPT-4o and GPT-4 Turbo are more recent, however the utilized gpt-35-turbo model is a good intersection of price versus performance and has been used extensively in deployed projects.
With the retriever function helping to build an augmented prompt, the default use case documented included three text chunks to buttress the original query. The OpenAI prompt response will not only be infused with the provided content extracted from the customer but unlike normal GPT responses, RAG will have specific attributions to which documents and specific paragraphs led to the response.
Brief Overview of Microsoft OpenAI Service Setup
Microsoft Azure has a long history of adding innovative new functions as subscribed “opt in” service resources, the Azure OpenAI Service is no different. A thorough, step-by-step guide to setting up the OpenAI service can be found here.
This screenshot demonstrates the rich variety of OpenAI models available within Azure, specifically showing the Azure OpenAI Studio interface, highlighting models such as gpt-4, gpt-4o and dall-e-3.
In this article, two models are added, one embedding and the other GPT. The following OpenAI Service Resource screen shows the necessary information to actually use our two models. This information consists of the keys (use either KEY1 and KEY2, both can be seen and copied with the Show Keys button) and the unique, per customer endpoint path, frequently referred to as the base URL by OpenAI users.
Perhaps the key Azure feature that empowers this article is the ability to disable network access to the configured OpenAI model, as seen below.
With traditional network access disabled, we can then enable private endpoint access and set the access point to a network interface on the private subnet connected to the inside interface of our F5 Distributed Cloud CE node.
The following re-visits the earlier topology diagram, with focus upon where the Azure OpenAI service interacts with our F5 Distributed Cloud multicloud network.
The steps involved in setting up an Azure site in F5 Distributed Cloud are found here. The corresponding steps for configuring an on-premises Distributed Cloud site are found in this location. Many options exist, such as using KVM or a bare metal server, the link provided highlights the VMware ESXi approach to on-premises site creation.
Demonstrating RAG in Action using OpenAI Models with a Secure Private Endpoint
The RAG setup, in lieu of vectorizing actual private and sensitive documents, utilized the OpenAI embedding LLM to process chunks taken from the classic H.G. Wells 1895 science fiction novel “The Time Machine” in text or markdown format. The novel is one of many in the public domain through the Gutenberg Project. Two NFS folders supported by the NetApp ONTAP appliance in a Redmond, Washington office were used: one for source content and one for supporting the ChromaDB vector database. The NFS mounts are seen below, with the Megabytes consumed and remaining available seen per volume, the ONTAP address can be seen as 10.50.0.220.
(Linux Host) #df -h
10.50.0.220:/RAG_Source_Documents_2024 1.9M 511M 1% /mnt/rag_source_files
10.50.0.220:/Vectors 17M 803M 3% /home/sgorman/langchain-rag-tutorial-main/chroma2
The creation of the vector database was handled by one Python script and the actual AI prompts generated against the OpenAI gpt-35-turbo model were housed in another script. This may often make sense, as the vector database creation may be an infrequently run script, only executed when new source content is introduced (/mnt/rag_source_files) whereas the generative AI tasks targeting gpt-3.5-turbo are likely run continuously for imperative business needs like helpdesk or code creations, as example purposes.
Creating the vector database first entails preparing the source text, typically remove extraneous formatting or less than valuable text fields, think of boilerplate statements such as repetitive footnotes or perhaps copyright/privacy statements that might be found on every single page of some corporate documents. The next step is to create text chunks for embedding, the tradeoff of using too short chunks will be lack of semantic meaning in any one chunk and a growth in the vector count.
Using overly long chunks, on the other hand, could lead to lengthy augmented prompts sent to gpt-35-turbo that significantly grow the token count for requests, although many models now support very large token counts a common value remains a total, for requests and responses, of 4,096 tokens. Token counts are the foundation for most billing formulae of endpoint-based AI models.
Finally, it is important to have some degree of overlap of generated chunks such that meanings and themes within documents are not lost; if an idea is fragmented at the demarcation point of adjacent chunks the model may not pickup on its importance.
The vectorization script for “The Time Machine” resulted in 978 chunks being created from the source text, with character counts per chunk not to exceed 300 characters. The text splitting function is loaded from LangChain and the pertinent code lines include:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=100,
}
The values of 100 characters of overlaps suggests each chunk will incorporate 200 characters of new text within the total of 300 grabbed. It is important to remember all characters, even white space, count towards totals. As per the following screenshot, the source novel, when split into increments of 200 new characters per chunk does indicate 978 chunks were indeed a correct total (double click to expand).
With the source data vectorized and secure on the NetApp appliance, the actual use of the gpt-35-turbo OpenAI model could commence. The following shows an example, where the model is instructed in the system prompt to only respond via information it can glean from the RAG augmented prompt text, the response portions shown in red font.
python3 create99.py “What is the palace of green porcelain?”
<response highlights below, the response also included the full text chunks RAG suggested would potentially support the LLM in answering the posed question>
Answer the question based on the above context: What is the palace of green porcelain?
Response: content='The Palace of Green Porcelain is a deserted and ruined structure with remaining glass fragments in its windows and corroded metallic framework.' response_metadata={'token_usage': {'completion_tokens': 25, 'prompt_tokens': 175, 'total_tokens': 200}, 'model_name': 'gpt-35-turbo',
The response of gpt-35-turbo is correct, and we see that the token consumption is heavily slanted towards the request (the “prompt”), with 175 tokens used whereas the response required only 25 tokens. The key takeaways are that the prompt and its response did not travel hop-by-hop over the Internet to a public endpoint, all traffic traveled with VPN-like security from the on-premises server and ONTAP to a private Azure subnet using F5 Distributed Cloud. The OpenAI model was utilized as a private endpoint, corresponding a network interface available only on that private subnet and not found within the global DNS, only the private corporate DNS or /etc/hosts files.
Adding Laser Precision to RAG
Using the default chunking strategy did lead to sub-optimal results, when ideas, themes and events were lost across chunk boundaries, even when including some degree of overlap. The following is one example:
- A key moment in the H.G. Wells book involves the protagonist meeting a character Weena, who provides strange white flowers which are pocketed. Upon returning to the present time, the time traveler relies upon the exotic and foreign look of the white flowers to attempt to prove to friends the veracity of his tale.
# python3 query99.py “What did Weena give the Time Traveler?”
As captured in the response below, the chunks provided by RAG do not provide all the details, only that something of note was pocketed, but gpt-35-turbo can therefore not return a sufficient answer as the full details are not provided in the augmented prompt. The screenshot shows first the three chunks and at the end the best answer the LLM could provide (double click to expand).
The takeaway is that some effort will be required to adjust the vectorization process to pick optimally large chunk sizes, and sufficient numbers to properly empower the OpenAI model. In this demonstration, based upon vectors and their corresponding text, only three text chunks were harnessed to augment the user prompt. By increasing this number to 5 or 10, and increasing each of the chunk sizes, all of course at the expense of token consumption, one would expect more accurate results from the LLM.
Summary
This article demonstrated a more secure approach to using OpenAI models as a programmatic endpoint service in which proprietary company information can be kept secure by not using the general purpose, insecure Internet to provide prompts for vectorization and general AI inquiries. Instead, an approach was followed where the Azure OpenAI service was deployed as a private endpoint, exclusively available at an address on a private subnet within an enterprise’s Azure subscription, a subnet with no external access.
By utilizing F5 Distributed Cloud Multicloud Networking, existing corporate locations and data centers can be connected to that enterprise’s Azure resource groups and private, encrypted communications can take place between these networks, the necessary routing and tunneling technologies are deployed in a turn-key manner without requiring advanced network skillsets.
When leveraging NetApp ONTAP as the continued enterprise storage solution, RAG deployments based upon Azure OpenAI service can continue to be managed and secured with well-developed storage administration skills. In this example, ONTAP housed both the source, sensitive enterprise content and the actual vector database resulting from interactions with the Azure OpenAI embedding LLM. Subsequent to a discussion on vectors and optimal chunking strategies, RAG was utilized to answer questions on private documents using the well-known OpenAI chat-35-turbo model.