F5 NGINX Gateway Fabric - the One Gateway for AI-Powered Applications
AI-powered applications are increasingly making their way into production Kubernetes environments. Whether it’s an internal chatbot, a customer-facing assistant, or an autonomous agent workflow, these applications share a common architectural pattern where traffic flows through multiple layers:
- A frontend serving the UI,
- A backend orchestrating API calls, and
- One or more LLM inference endpoints underneath.
While each layer has different requirements, F5 NGINX has had a long history of addressing each of them effectively. In this article, I will walk through how a single NGINX Gateway Fabric instance can serve the multiple layers of AI-powered applications.
A Single Gateway for the Entire Stack
The reference architecture below illustrates a typical LLM-powered chatbot application running on Kubernetes, with NGINX Gateway Fabric as the single data-plane entry point across all layers.
The full source, including Kubernetes manifests and setup scripts, is available at this GitHub repository (leonseng/ngf-agentic-reference-stack).
NGINX Gateway Fabric is configured via the Kubernetes Gateway API - the next generation of Kubernetes Ingress API, which brings several improvements. In this example, there is a single Gateway resource with two listeners declared: one for client-facing traffic (frontend and backend), and one for internal LLM inference traffic. HTTPRoutes for the frontend, backend and LLM simulator) resources bind each application component to the appropriate listener.
An InferencePool resource from the Gateway API Inference Extension project is also present to help NGINX Gateway Fabric select the optimal LLM pod based on the load each pod is experiencing.
One NGINX Gateway Fabric instance. No additional proxies at any layer.
What NGINX Gateway Fabric Brings to Each Layer
Layer 1: Reverse Proxy
NGINX Gateway Fabric serves the chatbot frontend in its most familiar role - routing HTTP traffic to the appropriate upstream service. In addition, NGINX Gateway Fabric provides capabilities without needing to modify the frontend application itself:
- TLS termination: offload TLS at the gateway, so the frontend pod serves plain HTTP internally while clients always connect over HTTPS.
- SSO/OIDC authentication: protect the UI behind an identity provider, ensuring only authenticated users can reach the chatbot.
- Header manipulation: inject user identity headers derived from the authenticated session downstream, so backend services can trust and act on them without re-authenticating.
Layer 2: API Gateway
The backend API layer is where NGINX Gateway Fabric starts to shine as an API gateway. With the ability to route traffic at the HTTP layer based on hostnames, URI and HTTP headers, NGINX Gateway Fabric can direct traffic to multiple upstream services from a single entry point, e.g. the chatbot frontend on one hostname and the OpenAI-compatible backend API on another. As the application grows, additional routes and backends can be attached to the same Gateway, keeping the data plane lean.
This is also the layer where security capabilities such as JWT authentication and rate-limiting can be applied to protect the backend API.
Layer 3: LLM Inference Gateway
Today, NGINX Gateway Fabric's native support of the Gateway API Inference Extension, together with an End-point Picker (EPP), has shown the possibility of advanced load-aware load balancing capabilities, based on LLM-specific metrics and capabilities to optimize traffic to LLM serving pods,
As the project matures, we expect to see more capabilities such as model-aware routing and canary deployments of new models, all delivered through standard Kubernetes Gateway API primitives, without custom proxy code or a separate inference routing layer.
What's Next
Traffic routing is the foundation of all applications, including AI-powered applications. For platform teams managing LLM-powered applications on Kubernetes, NGINX Gateway Fabric offers a single, operationally lean control point across every layer of the stack, from serving the frontend to intelligently routing inference traffic.
If you would like to get some hands-on experience with the above, a step-by-step deployment guide is available at the GitHub repository (leonseng/ngf-agentic-reference-stack).
Help guide the future of your DevCentral Community!
What tools do you use to collaborate? (1min - anonymous)