AI Inference for VLLM modelswith F5 BIG-IP & Red Hat OpenShift
Introduction to AI inferencing
AI inferencing is the stage where a pre-trained AI model uses its learned knowledge to make predictions, decisions, or generate outputs. AI inference requests are sent using the HTTP protocol, but traditional load balancing strategies are not optimal for these workloads. This is because:
- Regular HTTP workloads are fast, uniform, and cheap.
- LLM HTTP requests are slow, non-uniform, and really expensive, both computationally and economically (especially because of the use of AI accelerators — e.g.: GPUs, TPUS, ...).
Because of these, it is worth to spend cheap CPU to improve efficiency of the AI accelerators via precise load balancing. LLM requests vary significantly in computational demands due to prompt length, the model differences, and previous outcomes, leading to unpredictable request running times.
In this article, it will be shown how F5 BIG-IP can perform intelligent load balancing based on the request body and the state metrics of the inference servers. For parsing the AI requests, it takes advantage of BIG-IP v21 JSON capabilities. This will be showcased using Container Ingress Services (CIS) for Red Hat OpenShift and OpenShift's integrated Prometheus telemetry.
AI inferencing can be found in the components and the insertion point highlighted in green of the F5 AI Reference Architecture, shown next:
AI inference load Balancing
It is common to perform traffic steering to different pools based on the request’s content (aka Body Based Routing), for example selecting a pool of inference servers which have a lower-end GPU when it is a batch request or when the number of tokens is large. We refer to this as Business Logic from now on.
In a second stage, an inference server from the selected pool is chosen. To make an optimal decision, stats metrics are retrieved from the AI inference servers (VLLM) using Prometheus.
It is worth to emphasise that for this to work effectively, it is needed that the BIG-IP sends the traffic directly to the Inference Servers. If the Inference Servers are behind an ingress controller or a NodePort, then the load balancing decision cannot make effective use of the metrics gathered.
The next picture overviews the overall solution:
You can see in practice this solution in the following video
How it all works together
The Business Logic and the final Inference Server selection are implemented independently:
- The Business Logic is implemented by means of a generic iRule named InferenceBodyBasedRouting which is compiled by defining entries in a data group. That is, there is no need for iRules knowledge to configure the desired behaviour.
- The selection of the Inference Server is done by BIG-IP’s ratio load balancing algorithm, which is continually biased by changing the ratio values every few seconds. The logic to gather the metrics from Prometheus, calculate the desired ratios, and set these in the BIG-IP is done by a controller POD in OpenShift named LLM Load Controller that has been developed for this article. This controller works in parallel to the CIS: while the CIS creates the LLM Virtual Server and Pools, the LLM Load Controller modifies the CIS created Pools and Members.
The overall solution is depicted next:
All the files used in this solution can be found in this github repository
Details of the solution: Implementing Business Logic
The AI requests are encoded as JSON objects. The current version InferenceBodyBasedRouting allows defining conditions for any element in the JSON object by specifying the JSON path of the element, an operator and a value or a substring (at present it uses TCL´s string match) to compare it with. This allows to create Business Logic such as the one shown in the next logic diagram:
This sample diagram contains three conditions:
- Check if the model name is "*GPT", this is checked in the JSON string element ".model".
- Check if the AI request is interactive (e.g.: a chat), this is checked in the JSON boolean element ".stream".
- Check if the system prompt is instructing the AI to only answer questions related to a flight agent. This is checked in the JSON string element “.input[0].content[0].text".
As you can see from these conditions, it is paramount that BIG-IP is able to process JSON requests. Note from the diagram that the numbers in red are the Step IDs for each box. These IDs can be set arbitrarily and are used to transpose the diagram into the data group. The only action that can currently be indicated in the data group is selectPool which makes the AI MODEL selection. The resulting data group from the diagram above is shown next:
The iRule and this sample data group can be found in the business-logic folder of the repository.
Details of the solution: Dynamic Load Balancing based in VLLM metrics
To develop this functionality, a vLLM simulator has been used. This exposes the same metrics as a real vLLM server and greatly facilitates development because it doesn’t use the resources (and an AI accelerator) of a regular vLLM instance. Likewise, on a real vLLM server, the vLLM instances are discovered using the Service resource. The Services are then monitored by Prometheus by using the ServiceMonitor resource. Next, it is shown the snippet to perform the monitoring for one of the vLLM pools:
apiVersion: v1
kind: Service
metadata:
name: highend-gpt
namespace: vllm
labels:
app: highend-gpt
spec:
type: ClusterIP
selector:
name: highend-gpt
ports:
- port: 8000
name: vllm
targetPort: 8000
protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: highend-gpt
namespace: vllm
labels:
app: highend-gpt
spec:
selector:
matchLabels:
app: highend-gpt
endpoints:
- port: vllm
interval: 5s
path: /metrics
After applying the manifests above, instances will be automatically available in the OpenShift console, once AI traffic is received.
In this solution, the metrics are consumed by an LLM load controller that has been created for this demo and can be found in the llm-load-controller directory.
The LLM load controller currently retrieves the following vLLM prometheus metrics for each Inference server in the given MODEL Pool:
-
Number of requests in the waiting queue:
vllm:num_requests_waiting{{model_name="{MODEL}"} -
KV cache utilization:
vllm:gpu_cache_usage_perc{{model_name="{MODEL}"}} - Average waiting time in the queue during last 15 seconds:
histogram_quantile(0.9, sum by(le, instance) (rate(vllm:request_queue_time_seconds_bucket{{model_name="{MODEL}"}}[15s])))
This controller is a Python script that operates for only a single MODEL POOL. In order to support several MODEL Pools as shown in the demo, an additional container is created for each additional MODEL Pool. This greatly simplifies the script and allows to easily adapt the settings for each MODEL Pool independently. The next deployment snippet shows how the llm-load-controller is deployed:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-load-controller
namespace: vllm
spec:
# The number of replicas has to be always 1
replicas: 1
selector:
matchLabels:
app: llm-load-controller
template:
metadata:
labels:
app: llm-load-controller
spec:
imagePullSecrets:
- name: regcred
serviceAccountName: llm-load-controller-sa
containers:
- name: highend-gpt
image: $REGISTRY_PATH/llm-load-controller:latest
imagePullPolicy: Always
env:
- name: BIGIP
value: "$BIGIP_IP"
- name: BIGIP_USER
valueFrom:
secretKeyRef:
name: bigip-login
key: username
- name: BIGIP_PASS
valueFrom:
secretKeyRef:
name: bigip-login
key: password
- name: POOL
value: "/vllm/Shared/highend_gpt_8000_vllm_inference_f5demo_com_highend_gpt"
- name: MODEL
value: "HighEndGPT"
- name: INTERVAL
value: "2"
- name: COEF_RQT
value: "1"
- name: COEF_NRW
value: "1"
- name: VERBOSITY
value: "2"
The above Deployment file shows a single container for handling the MODEL HighEndGPT, another container entry is needed for each additional MODEL (in the same Deployment file). See for details the sample ``llm-load-controller/llm-load-controller-deployment.yaml`` file.
Note that all parameters of the llm-load-controller are taken from environment variables. The most relevant parameters are the Pool name (automatically created by CIS) and the MODEL name from where to retrieve the metrics.
Some additional details:
- The script allows to add a weight to each metric when calculating the ratio. Currently vllm:gpu_cache_usage_perc is not being used.
- For this demo, Prometheus retrieves the metrics every five seconds and the LLM load controller samples these retrieved metrics every two seconds. These values can be customized.
- These Prometheus metrics are retrieved through Thanos, using by default the endpoint
thanos-querier.Openshift-monitoring.svc.cluster.local:9091. Thanos requires authentication, which llm-load-controller performs using the serviceAccountName configured.
For further details, check the top of the ``llm-load-controller/llm-load-controller.py`` script.
Conclusion and next steps
F5 BIG-IP v21 introduces support for JSON protocol, which greatly facilitates dealing with AI inference requests and implement the desired Business Logic. Moreover, using F5 BIG-IP iControlRest API, it is possible to bias the load balancing decision with metrics gathered from Prometheus to optimize the response time and improve the overall performance of the inference server pools. All this is automated using F5 CIS for Red Hat OpenShfit, which incorporates Prometheus.
This article is a PoC showing the potential of BIG-IP v21 for AI inferencing workloads. In the future, this could be extended in several ways:
- The InferenceBodyBasedRouting iRule could be used to allow logic based on the number of tokens or perform request rewrite, which BIG-IP’s JSON protocol support allows doing easily.
- The llm-load-controller could be enhanced to bias the decision based on the KV-cache.
Help guide the future of your DevCentral Community!
What tools do you use to collaborate? (1min - anonymous)