Mar 27, 2026 - For details about updated CVE-2025-53521 (BIG-IP APM vulnerability), refer to K000156741.

AI Quickstart: A Ready-to-Run Reference for Securing LLM Inference on OpenShift AI with F5 XC WAAP

 

The F5 + Red Hat AI Security Quickstart is now live in the Red Hat AI catalog — a ready-to-run reference for securing LLM inference endpoints on OpenShift AI with F5 Distributed Cloud WAAP.

Picture a financial services team that just wrapped up a successful pilot of an AI assistant for their underwriting desk. The assistant is grounded in the firm’s own documents — policy manuals, risk frameworks, compliance filings — using Retrieval-Augmented Generation. Business stakeholders are asking when it goes into production.

Then the security review lands. The LLM inference endpoint has no WAF policy, no schema enforcement, no rate limiting. A pen tester sends a request with an embedded <script> tag in the chat payload. It goes straight through to the model. She sends it 500 times in a minute. Every request succeeds. She queries /v1/version — an internal operational endpoint never meant to be external — and it responds with model metadata.

The rollout is paused. The AI team is frustrated. The security team is not wrong.

This is the scenario we built the F5 API Security AI Quickstart to solve — and why we partnered with Red Hat to publish it in the Red Hat AI Quickstarts catalog.


What the Quickstart Deploys

The quickstart stands up a complete, secured AI stack on Red Hat OpenShift AI in two steps — a Helm install for the RAG stack, and an Ansible playbook for the F5 XC security layer.

RAG Stack (Helm) — make install deploys all four components together:

  • LlamaStack (port 8321) — AI orchestration framework providing an OpenAI-compatible API surface for chat completion, embeddings, and model management
  • LLM Service (vLLM) — GPU-accelerated model inference; supports Llama-3.2-3B-Instruct by default, with pre-configured options up to Llama-3.3-70B
  • Streamlit UI (port 8501) — chat interface for interacting with the RAG assistant and uploading documents
  • PostgreSQL + pgvector — vector database for semantic document retrieval; documents are embedded at ingestion and retrieved by similarity at query time

Security Layer (Ansible) — deploys the F5 XC Customer Edge as pods inside the cluster:

  • The CE connects to the F5 Distributed Cloud backbone and discovers cluster services via the Kubernetes API
  • All traffic to LlamaStack APIs passes through the CE before reaching the inference runtime or vector database
  • Kernel configuration (HugePages) and storage validation are automated as part of the same playbook — not manual prerequisites

Architecture diagram: F5 XC Customer Edge deployed inside OpenShift intercepts all traffic to LlamaStack APIs
Figure 1: F5 XC Customer Edge co-located inside OpenShift, protecting LlamaStack APIs and model servers. All user traffic passes through the CE before reaching inference.

The key design decision: the F5 XC Customer Edge runs inside the cluster, not in front of it as an external appliance. It enforces WAF, API specification, and rate-limiting policies within the cluster's network fabric — no traffic tromboning, no added round-trip for large LLM payloads.


Three Security Use Cases, Hands-On

The quickstart ships with an end-to-end security testing guide covering three use cases — a starting point that demonstrates the fundamentals, while only scratching the surface of what F5 Distributed Cloud WAAP can do across its full feature set (bot defense, client-side protection, DDoS mitigation, service policy, and more).

Use Case 1 — WAF: Block injection attacks
Simulate an XSS payload inside a chat completion request body. Watch it pass through unprotected. Enable a F5 XC WAF policy, re-run the same request, and see it blocked — with the event logged in Security Analytics.

Use Case 2 — API Specification Enforcement: Kill shadow APIs
Hit /v1/version — an internal endpoint never meant to be external. Upload the approved OpenAPI spec to F5 XC and enable API Inventory enforcement. Re-run the same request: 403 Forbidden. Anything not in the spec is blocked by default.

Use Case 3 — Rate Limiting: Protect GPU compute
Fire 20+ requests per minute from a single client — all succeed. Configure a per-client rate limit in F5 XC. Requests beyond the threshold return 429 Too Many Requests. GPU compute is now a protected resource, not an open buffer.


Why This Matters Beyond the Demo

LLM inference endpoints are HTTP APIs. They accept free-form text inside JSON and process it without validation — including injection payloads, shadow API calls, and unconstrained request floods. The "it's only accessible inside the cluster" assumption breaks down faster than most teams expect: a misconfigured route, a developer exposing Swagger to test something, a service mesh misconfiguration. I've seen each of these in production-bound deployments over the past year.

The financial services framing is illustrative, not exclusive. Healthcare organizations protecting PHI, government agencies running citizen-facing AI, any enterprise with a private LLM and sensitive internal data — the same gaps exist, and the same architecture closes them.

What makes this a reference architecture rather than just a demo: the security controls are native to the deployment. The Ansible playbook that deploys the F5 XC Customer Edge is part of the infrastructure bring-up — not a post-go-live addition. Security teams get WAF coverage, API schema enforcement, and rate limiting in place before the first production request is ever sent.


Ready to try it?

Browse the quickstart in the Red Hat AI Quickstarts catalog — deployment instructions, security use case guides, and architecture docs all in one place.

Clone the repo and run through the three use cases: github.com/rh-ai-quickstart/f5-api-security

This quickstart was developed in collaboration with the Red Hat AI team and is part of the Red Hat AI Quickstarts catalog.

Published Mar 26, 2026
Version 1.0
No CommentsBe the first to comment