Advertise OpenShift AI inference servers from F5 Distributed Cloud

Introduction

This article describes how Inference servers in OpenShift AI (KServe), hosted in public, private cloud or edge, can be anycast-advertised securely to the Internet using F5 Distributed Cloud (XC) deployed inside OpenShift clusters.

Red Hat, and by extension OpenShift AI, provides enterprise-ready, mission-critical Open Source software. In this article, the AI model is hosted in OpenShift AI’s KServe single-model framework. For the creation of this article, OpenShift in AWS (aka ROSA) was used. It could have used OpenShift in any public or private cloud, edge deployment or a mix of these. Once the model is available for serving in OpenShift, XC can be used to advertise it globally. This can be done by just installing an in-cluster XC Customer Edge (CE) SMSv1 in OpenShift. This CE component transparently connects to the closest Regional Edges (RE) of F5 XC´s Global AnyCast Network, exposing the VIP of the AI inference server in all F5 XC´s PoPs (IP anycast), reducing latency to the customer, providing redundancy and application security including Layer 7 DDoS.

The overall setup can be seen in the next figure. The only F5 component that has to be installed is the CE. The REs are pre-existing in the F5 Global Anycast Network, and are used automatically as access points for the CEs. Connectivity between the CEs and REs happens through TLS or IPsec tunnels, which are automatically set up at CE deployment time without any user intervention.

Setup overview

The next sections will cover the following topics:

Traffic path overview
Setup of an inference service of generative AI model using KServe and VLLM.
Setup of F5 XC CE in OpenShift using SMSv1.
Create a global anycast VIP in XC exposing the created inference server.
Securing the inference service.

Traffic path overview

Traffic flow overview (simplified)

The traffic flow is shown in the figure above, starting with the request towards the inference service:

A request is sent to the inference service (e.g.: inference.demos.bd.f5.com). DNS resolves this to a F5 XC Anycast VIP address. Through Internet routing, the request reaches the VIP at the closest F5 XC Point Of Presence (PoP).
F5 XC validates that the request is for an expected hostname and applies any security policy to the traffic. F5 XC load balances towards the CEs where there are origin pools for the applications. In this article, a single OpenShift as cluster is used, but several on different sites could have been used, all using the same VIP. The traffic is ultimately sent to the CE´s designated RE.
Traffic reaches the CE inside the OpenShift cluster through a pre-established tunnel (TLS or IPsec).
The CE has previously discovered which are the local origin servers through DNS service discovery within the OpenShift cluster. For this AI model, KServe deploys a Service Type: ExternalName named vllm-cpu.vllm.svc.cluster.local. This is the recommended way to access a KServe AI model from workloads that are not part of the mesh, like the CE component. The exact service name for the deployed model is reported in the OpenShift UI as shown in the next figure:

Retrieving the internal (local) inference endpoint for the deployed modelThe corresponding ExternalName for vllm-cpu.vllm.svc.cluster.local (effectively a DNS CNAME) is Istio´s kserve-local-gateway.istio-system.svc.cluster.local Gateway exposed by another Service as the name indicates. This is shown next:
```
% oc -n vllm get svc vllm-cpu
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
vllm-cpu ExternalName <none> kserve-local-gateway.istio-system.svc.cluster.local <none> 4d1h
```

The Customer Edge sends the traffic towards kserve-local-gateway´s Service clusterIP. This Service load-balances between the available Istio instances (in the next output, there is only one).

% oc -n istio-system get svc,ep kserve-local-gateway
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kserve-local-gateway ClusterIP 172.30.6.250 <none> 443/TCP 5d6h

NAME ENDPOINTS AGE
endpoints/kserve-local-gateway 10.131.0.12:8445 5d6h

The CNI sends the traffic to the selected Istio instance.
Istio ultimately sends the request to the AI model PODs.

The steps 6 and 7 have been simplified because within these steps, there are activators, autoscalers, and queuing components, which are transparent to F5 XC. These components are set automatically by KServe´s Knative infrastructure and would add unnecessary complexity to the above picture. If you are more interested in these steps, details can be found at this link.

Setup of an inference service of generative AI model using KServe and VLLM.

In order to instantiate an AI model in KServe, it is needed to create three resources:

A Secret resource containing the storage (Data Connection) to be used.
A ServingRuntime resource, which defines the mode, its image, and its parameters. The next resource uses it. In the OpenShift AI UI, these are created using Templates.
An InferenceService resource that binds the previous resources and actually instantiates the AI model. It specifies the min and max replicas, the amount of memory for the replicas and the storage (data connection) to be used.

The AI model used as example in this article uses custom configuration for a VLLM-CPU model (useful for PoC purposes). You can find another example of a VLLM CPU AI model in https://github.com/rh-ai-kickstart/vllm-cpu. The configuration is as follows:

Example Data Connection for this example, using S3:

Creation of the Secret for the storage (Data Connection)

The example ServingRuntime using a VLLM model for CPUs:

Creation of the ServiceRuntime

The InferenceService (shown as Models and model servers in the UI) used in this example:

Creation of the InferenceService

Setup of F5 XC CE SMSv1 in OpenShift

Please note that presently the CE SMSv2 for in cluster Kubernetes deployments is not available yet.

To deploy CE SMSv1 as PODs in the OpenShift cluster, follow these instructions in F5 XC docs site.

Creating a global anycast VIP in XC exposing the created inference server

It will create the following objects in the given order:

Create an HTTP/2 health check for the AI model.
Create an Origin Pool for the AI model, attach to it the created health check.
Create a VIP specifying Internet advertisement and attach the created Origin pool.

These steps are described next in detail.

XC console

In Manage >> Load Balancers >> Health Checks, create a new HTTP/2 health check indicating a path that can be used to test the AI model, in this example this is "/health". The whole configuration is shown next:

Health check configuration

In Manage >> Load Balancers >> Origin Pools, create a new pool where the servers are discovered using "DNS Name of Origin Server on given Sites" for the DNS name vllm-cpu.vllm.svc.cluster.local (from the traffic flow overview section) in the Outside network (the only one the CE PODs actually have). Attach the previously created health check and set it to do not require TLS validation of the server. This latter is necessary due to internal Istio components using self-signed certificates by default. The configuration is shown next:

Origin Pool creation

In Manage >> Load Balancers >> HTTP Load Balancers, create a new Load Balance, specify the FQDN of the VIP and attach the the previously created Origin Pool. In this example, it is used as an HTTP load balancer. XC automatically creates the DNS hosting and certificates.

HTTP Load Balancer Creation

At the very bottom of the HTTP Load Balancer creation screen, you can advertise the VIP on the Internet.

HTTP Load Balancer creation, VIP advertisement in the Internet

Securing the inference service

Once the inference service is exposed to the internet, it is exposing many APIs that we might not want to expose. Additionally, the service has to be secured from abuse and breaches. To address these, F5 XC offers the following features for AI:

Automated API Discovery & Posture Management:
- Identify all inference endpoints automatically, eliminating hidden APIs.
- Enforce schemas based on observed traffic to ensure requests and responses follow expected patterns.
- Integrate “shift-left” security checks into CI/CD pipelines, catching misconfigurations before production.
LLM-Aware Threat Detection & Request Validation:
- Detect attempts to manipulate prompts or break compliance rules, ensuring suspicious requests are blocked before reaching the model.
Bot Mitigation & Adaptive Rate Controls:
- Differentiate between legitimate users and bots or scrapers, blocking automated attacks.
- Dynamically adjust rate limits and policies based on usage history and real-time conditions, maintaining performance and reliability.
Sensitive Data Redaction & Compliance:
- Identify and mask PII or sensitive tokens in requests and responses.
- Adhere to data protection regulations and maintain detailed logs for auditing, monitoring, and compliance reporting.

All these security features in the same XC console, where both application delivery and security dashboards are available centralized. These provide analytics to monitor real-time metrics—latency, error rates, compliance indicators—to continuously refine policies and adapt to emerging threats.

It is recommended to check this article's "F5 Distributed Cloud Capabilities in Action" section to see how to implement these.

Conclusion and final remarks

F5 XC can use Red Hat OpenShift AI in AWS, or any other public or private cloud. This can help F5 XC easily share an AI model with the Internet and provide security, preventing breaches and abuse of these models.

I hope this article has been an eye-opener for the possibilities of how F5 XC can easily and securely advertise AI models. This article shows how to advertise the AI model on the Internet. XC lets you advertise it in any private place just as easily. I would love to hear if you have any specific requirements not covered in this article.

Published Jun 18, 2025

Version 1.0

Employee

Solutions architect in Business Development with focus in automation and integration with partner's technologies. Prior to this role I was consultant in Professional Services and escalations engineer in Technical Support. Outside F5, I worked in mobile and wired network operators as network engineer. I started my career in academic research. In all these years, no matter what I've been doing Linux has been always the best tool. Not the one in the picture :-)

View Profile

Advertise OpenShift AI inference servers from F5 Distributed Cloud

Introduction

Traffic path overview

Setup of an inference service of generative AI model using KServe and VLLM.

Setup of F5 XC CE SMSv1 in OpenShift

Creating a global anycast VIP in XC exposing the created inference server

Securing the inference service

Conclusion and final remarks

ABOUT DEVCENTRAL

RESOURCES

SUPPORT

PARTNERS