Run AI LLMs Centrally and Protect AI Inferencing with F5 Distributed Cloud API Security
The art of implementing large language models (LLMs) is quickly transitioning from early adoption investigations to business-critical production-ready offerings. Take just one example, human help desk operators with immediate, pressing customer issues at hand. Modern helpdesk software packages can today be augmented to have rich LLM inferencing occur programmatically in back-end networks to drive successful customer outcomes for these situations, with a rich understanding of items like corporate return policies, part number equivalencies and suggested creative and tactical advice to humans within seconds.
LLMs will more frequently be run on corporate compute, under the purview of enterprise DevOps teams, as opposed to only cloud SaaS consumption approaches where reachability and sound security practice fall upon a third party’s shoulders.
This article speaks to hands-on experience with RESTful API-driven LLM inferencing, using common technologies, including Python, FastAPI and Pytorch libraries, and an LLM with binaries quickly downloaded from Hugging Face, the world’s largest purveyor of open source, fine-tuned LLMs. Other models from Hugging Face were examined, such as versions of TinyLlama or Llama-2 variants, the options are almost limitless. There are also approaches to running your LLMs that lean towards more turn-key setups, such as utilizing Ollama or LLM Studio which also offer the possibility of API access.
In the end, the desire was to try to focus on one open-source LLM and a lowest-common-denominator approach to LLM hosting, based upon the simplest Python libraries and frameworks. As this style of hosted AI consumption is deployed into production, the requirement exists for enterprise-grade security including rich analysis and enforcement around the API transactions.
The solution harnessed to achieve a safe and performant end state is F5 Distributed Cloud (XC), both App Connect for secure web service publishing through a distributed load balancer and the API Security module enacted upon the load balancer. The latter, part of the overall WAAP feature set, offers modern features like API response validation, API rate limiting to guard against rogue users, and PII rules to alert upon AI traffic conveying sensitive data.
A key aspect of this investigation is how repeatable this setup is; it is not a bespoke customized deployment. For instance, Hugging Face offers thousands of LLMs that could be swapped into the Ubuntu server in use. The F5 XC deployment can facilitate an API service reachable to, say, specific enterprise locations, or perhaps specific cloud tenants or exposed to the entirety of Internet users. The solution is powerful security implemented with simple design choices dictated only by how one wishes the LLM knowledge to be consumed.
F5 Distributed Cloud App Connect and LLM Setup
The first step in demonstrating F5 XC as an instrument to securely deliver LLM services is to understand the topology. As depicted, the LLM was located in a data center in Redmond, Washington, and attached to the inside network of an XC customer edge (CE) node. The CE node automatically connects through redundant encrypted tunnels to geographically close regional edge (RE) nodes (Seattle and San Jose, CA). The DNS name for the LLM service is projected into the global DNS infrastructure; through XC’s use of anycast, clients on the Internet will see their API LLM traffic gravitate conveniently to the closest RE site.
The exposure of the LLM service to an audience of the enterprise’s choosing is based upon the “distributed” load balancer. This is a service that lets one publish application reachability in highly controlled manners, such as via DNS or Kubernetes services in one specific building/VPC/Vnet, as one extreme, all the way to the totality of the Internet like in this presented use case. F5 XC solutions that publish services based upon load balancers are empowered by the XC “App Connect” module, one of a suite of modules available in the platform.
The distributed HTTP load balancer for this deployment safely funneled traffic to an origin pool in Redmond, Washington consisting of one server running the LLM. The services are reflected in the following revised service diagram.
The LLM Environment Described
To operate an LLM on an enterprise’s own compute platforms, the solution will typically be underpinned by a Linux distribution like Ubuntu and support for the Python 3 programming language. The key Python libraries and frameworks used to operate an LLM in this case included Pytorch, Langchain and the FastAPI offerings. The preponderance of current LLM application notes pertains to LLM inferencing through web interfaces, such as a chatbot-style interface. The most prevalent Python library to support this interactive web experience approach is Streamlit.
The design choice for this investigation was to, instead, deviate towards a RESTful (or REST) API approach to inferencing, as this is likely a significant growth area as AI enters production environments. Various approaches exist to supplement web-based services with an API interface, such as Flask or Django, however, FastAPI was selected in this case as it is extremely popular and easy to set up.
Finally, the choice of finding a representative LLM was made, a decision that aimed for modest resources in terms of the size of the binaries, memory consumption, and ability to generate content with only a virtualized multi-core CPU at its disposal. Using Hugging Face, a leading repository of open source LLMs, the following LLM was downloaded and installed: LaMini-Flan-T5-77M, which is trained with 77 million parameters and was originally arrived at through fine-tuning of the LLM Google/Flan-T5-Small.
The LLM inferencing, with XC in place, was conducted with Curl and Postman as the API clients. The following demonstrates a typical inference engaging the Redmond LLM, in this case with Postman, from a client in eastern Canada (double click image to expand).
The LLM used was useful in producing a test bed, however the results varied in terms of accuracy. When testing truly generative aspects of AI, the much-discussed transformer use case, performed satisfactorily. When asked via API to “Please create a simple joke suitable for an eight-year-old child” it rose to the challenge with the acceptable “Why did the tomato turn red? Because it saw a salad dressing!”
Factual-oriented inquiries, however, were often less than stellar and likely, in some part, attributable to the relatively small number of parameters with this LLM, 77 million, as opposed to billions. When asked “Who is Barack Obama” the response correctly indicated a former president of the United States, but all ancillary details were wrong. Asked for the details on who Nobel prize winning John Steinbeck was, and why he was famous, the response was seemingly incorrect, reflective of a musical prodigy not the internationally-known author.
Leveraging F5 Distributed Cloud API Security: Protected and Performant Outcomes
The value of surrounding LLM inferencing via the F5 XC solution includes security “at the front door”, or in other words, API security features were implemented at the RE edge/load balancer and thus filtered traffic when required before delivery to the customer edge/data center.
One of the foundational pieces of API Security with XC is the ability to move toward a positive security model while allowing a “fall through” mode to both deliver but also direct attention toward traffic targeting API endpoints that do now fall within the expected Open API Specification (OAS) traffic definition. A positive security model allows known good traffic through a solution and strives to block other traffic. However, to avoid unexpected application breakage, such as after one team updates application software but the new API documentation is delayed by, say, a few days, it is often better that operations teams be alerted to new traffic flows and throttle it via rate limiting. This is as opposed to outright blocking such traffic to avoid entirely breaking the customer experience.
Towards an API Positive Security Approach using F5 Distributed Cloud
A process followed in the exercise was to allow API traffic unfettered for a period of time, a day in this case, to do an initial discovery of things like API endpoints and HTTP methods in use. After this time an OAS specification, often historically referred to as a Swagger file, can be saved by the operator and then immediately reloaded as the “gold standard” for permitted traffic. This becomes the “Inventory” of expected API traffic.
After further, unexpected traffic is experienced, the XC API discovery pane will list this traffic as “Shadow” API traffic, the operator is directed to the offending live API endpoint, and traffic can be blocked (HTTP 403 Forbidden) or the often more palatable option of rate-limiting the traffic (HTTP 429 Too Many Requests). Working through the numbered annotations one can see the example of an hour of API traffic, and how quickly the operator can see the divergence of actual traffic from the OAS/Swagger definition. Shadow API traffic can be blocked or rate-limited by clicking on the offending endpoint hyperlink (double click image to expand).
Hosted LLM Performance Monitoring with Distributed Cloud
An observation from operating a Hugging Face LLM on the server is, as expected, the inferencing in an out-of-the-box deployment will generally all focus on one API endpoint. In the case of this setup, the endpoint was <FQDN>/lamini. There are probability distribution functions (PDF) for all of the key performance metrics an operator would gravitate to, issues like response latency in milliseconds, both mean and 95th percentile. The distribution in LLM request sizes would also be interesting, to harvest how users task the LLM, are request sizes excessively large as an example.
The charts will populate upon a critical mass of network traffic hitting the LLM/API endpoint. This is a representative view of key performance metrics, taken from another API endpoint that XC is supporting through a persistent traffic generator.
Protect the Financial Viability of Your Hosted LLM Service through XC API Rate Limiting
To monetize an LLM service for a customer base, the provider likely has many reasons to be cautious, specifically to control per-user inference loads. Examples include:
- Tiered pricing, for instance, a no-charge rate of 10 queries (inferences) per hour and then rate-limiting to stop any excess consumption; also, a paid offering with, say, 200 queries per hour
- Backend, elastic compute resources that expand to handle excessive inferencing load or time-of-day fluctuations; utilize rate limiting to constrain incurred costs due to a handful of rogue users
By simply clicking on the hyperlink in the discovered API endpoint list, in this example the endpoint /llm012, we can specify the threshold of transactions to allow. In this case, five inferences within five minutes are accepted after which HTTP 429 messages will be generated by the XC RE node serving the user.
The result after a burst of requests from Postman will look like the following. Notice the specific F5 node where the HTTP load balancer in question has been instantiated, is seen. In this case, the user is entering the XC fabric in Toronto, Ontario.
Response validation of LLM AI Inferencing Responses
The world of RESTful APIs almost universally sees responses encoded through JSON notation. A key security feature, and not commonly available in the industry, is to monitor responses, not just requests, for conformance to rules set out by the API provider. For instance, it is well known that a valid concern exists around LLMs is “jail breaking”, crafting a strategy to make an LLM produce response content which it is normally prevented from providing. At the deeper, micro-level, an API response itself may have ground rules, for example, perhaps JSON strings or numbers are permissible in responses, but JSON arrays are forbidden.
With XC API security, just as we can learn the API endpoints (URL and path) and HTTP methods (GET, POST, PUT, etc.) we can also detect the schema of transactions, including normal HTTP headers and bodies in the response path. Take the following example, we can see in the screenshot the solution has learned that for API endpoint /llm014 the sample body should have an array value, with integer members making up the array. To see this screen, one need only click on the hyperlink entry for API endpoint /llm014.
With this, purely illustrative, example in mind, an operator can simply enable API inventory validation which is applicable to either or both request and response traffic. To set up a rule whereby any violation in the type values expected in JSON responses, such as string values, one may choose “Report”. This will create security events for violations, alternatively one can choose “Block” to outright prevent responses from leaving the load balancer. As depicted below, the HTTP load balancer is named “aiservice1” and the operator has selected to receive security events (Report) should response bodies deviate from the learned schema.
Detection of Sensitive Information in AI API Transactions
One of the most pressing concerns in network security is the undetected inclusion of sensitive information within network traffic, this includes LLM traffic. This may be personally identifiable information (PII), such as names and addresses, it may also be simply poor application design where items like internal IP addresses are unknowingly exposed, such as by middleware devices and “X-header” values being appended.
In the following example from the demonstration LLM hosting environment, an API endpoint is observed to be offering up credit card numbers in responses. We also see the richness of detail in the overall display at the top, both attacked API and most active API endpoints are additionally seen.
Sensitive information detection in the XC API offering is quite flexible. There is a combination of built-in pattern recognition rules, but also custom detectors can be easily added to the HTTP load balancer using regex settings. Although regex may on the surface seem challenging to create from scratch, in fact, simply using an Internet search engine can provide suggested rules for a wealth of potentially problematic values found in flight. As just one example, a quick search reveals the following as the expected format and corresponding regex settings for Canadian health care cards (OHIP numbers) in the province of Ontario. The alphanumeric structure is covered in three rules, with spaces, with dashes, and with no delimiters at all:
(?i:\b[0-9]{4} [0-9]{3} [0-9]{3}[A-Z]?\b)
(?i:\b[0-9]{4}-[0-9]{3}-[0-9]{3}[A-Z]?\b)
(?i:\b[0-9]{10}[A-Z]?\b)
A nice set of free DLP test sites exist that can provide sample, dummy test values to validate Regex rules against if desired, such as drivers’ license formats from around the world as one example. The following screenshot shows an XC API endpoint that has been discovered with both built-in sensitive data types (credit card and IP address) and custom regex-based types (French social security numbers and mobile phone values).
Summary of Hosted LLM and Distributed Cloud API Security Findings
An observation about the LLM hosting exercise was the concentration of Restful API calls to a single API endpoint using one HTTP method, in our case <FQDN>/lamini and HTTP GETs. Expected scenarios exist where the volume of API endpoints would grow, for instance, if more models were downloaded from a source like Hugging Face and concurrently run. This may leverage multiple Python 3 virtual environments enabled on a server to support different LLM conflicting library version requirements.
The Distributed Cloud API Security module easily discovered nuances of the traffic, in both request and response directions, and allowed an overall API definition file (OAS/Swagger) to be generated with a single mouse click.
There are compelling security features available to protect your hosted LLM traffic, such as rapid detection of shadow, undocumented API endpoints, and the ability to validate the accepted schema of payloads in both directions. User-specific rate limiting is considered a core feature to both thwart rogue users and protect monetary investments in LLMs resources.
With API-enabled LLMs, through Python libraries such as FastAPI or Flask, security is imposed readily and with customizations needed for issues like PII detection. With web client-oriented LLMs, such as chatbot interfaces enabled through libraries like Streamlit, XC offers an advanced bot detection and mitigation module. With the fact that API and interactive web access to hosted LLMs can be concurrently enabled, the breadth of the Distributed Cloud tools will be put to good use.