Securing Model Serving in Red Hat OpenShift AI (on ROSA) with F5 Distributed Cloud API Security

As enterprises embrace Generative AI—particularly deploying large language models (LLMs) and other foundational AI models—production environments become increasingly complex. Organizations need an end-to-end MLOps platform that streamlines the entire lifecycle: developing, training, fine-tuning, and especially serving models at scale.

Red Hat OpenShift AI (OpenShift AI) meets this need by providing a comprehensive, hybrid MLOps environment. When deployed on Red Hat OpenShift Service on AWS (ROSA), OpenShift AI takes advantage of AWS’s managed infrastructure services and cloud-native elasticity, enabling organizations to scale AI/ML workflows efficiently and cost-effectively.

Yet, as models are served to end-users or integrated into downstream applications, security considerations become paramount. Inference endpoints may be targeted for unauthorized access, data exfiltration, or prompt manipulation. F5 Distributed Cloud addresses these challenges by offering robust capabilities—API discovery, schema enforcement, LLM-aware threat detection, bot mitigation, sensitive data redaction, and continuous observability—to ensure that inference endpoints remain secure, compliant, and high-performing.

In this post, we will:

Introduce OpenShift AI and its capabilities for AI/ML workloads with an emphasis on model serving.
Discuss how running OpenShift AI on ROSA leverages AWS services for scalable, cost-effective AI/ML operations.
Show how F5 Distributed Cloud API Security enhances the security posture of generative AI model inference endpoints—covering automated API discovery, schema enforcement, threat detection, rate limiting, and compliance measures.
Demonstrate how to integrate these capabilities end-to-end, using Ollama Mistral-7B as an example model.

What is Red Hat OpenShift AI for Generative AI Applications?

Red Hat OpenShift AI is a hybrid MLOps platform designed to simplify the entire AI lifecycle. It empowers teams to train, fine-tune, deploy, serve, monitor, and continuously improve generative AI models. By merging IT operations, data science workflows, and application development practices into one platform, OpenShift AI accelerates innovation, fosters governance, and encourages collaboration—essential ingredients for delivering enterprise-grade AI solutions.

Key Features:

Hybrid MLOps Platform: Unified environment for data scientists, developers, and operators, whether you run on-premises, in the cloud, or in a hybrid setup.
Distributed Workloads & Fine-Tuning: Scale model training and fine-tuning across distributed compute frameworks, adapting large language models to your domain’s requirements.
Model Serving & Monitoring: Deploy and serve models at scale using technologies like KServe, ModelMesh, and a variety of specialized runtimes (OpenVINO™ Model Server, Caikit/TGIS, NVIDIA Triton Inference Server, etc.). Continuously monitor model performance, detect drift, and ensure ongoing model quality.
Lifecycle Management & DevOps Integration: Seamlessly integrate data science pipelines with CI/CD workflows. Automate model deployment, rollout new versions safely, and achieve consistent delivery of AI-driven features.
Enhanced Collaboration: Enable data scientists, developers, and IT Ops to work together using notebooks (JupyterLab), popular frameworks (TensorFlow, PyTorch), and unified governance, speeding up innovation and time-to-value.

High-Level Architecture:
OpenShift AI integrates workbenches, distributed workloads, data science pipelines, serving engines, and monitoring tools atop Kubernetes and OpenShift operators, leveraging GitOps, pipelines, service mesh, and serverless technologies.

Figure 1: OpenShift AI High-Level Architecture

Running OpenShift AI on ROSA for Scalable AI/ML Solutions

Red Hat OpenShift Service on AWS (ROSA) brings a fully managed OpenShift environment to AWS. This allows teams to focus on building and serving AI models rather than managing infrastructure.

Key advantages:

Scalability: Seamlessly scale GPU-accelerated compute, storage, and networking resources as model serving workloads grow.
Cost Efficiency & On-Demand Resources: Leverage Amazon EC2 instances, Amazon S3, and other AWS services only as needed, paying as you go.
Unified Management: Offload cluster operations and lifecycle management to ROSA, ensuring reliable operations and freeing your team to concentrate on AI innovation.

Security Challenges in Model Serving for Generative AI

While OpenShift AI and ROSA simplify operations, serving AI models still raises critical security concerns:

Unauthorized Access & Data Leakage: External requests to inference endpoints may attempt to extract proprietary knowledge or sensitive data from the model.
Prompt Injection & Malicious Content: LLMs can be tricked into producing harmful or confidential outputs if the prompts are manipulated.
Bot Attacks & Performance Risks: Automated scripts can overwhelm inference endpoints, degrade performance, or escalate costs.
Compliance & Sensitive Data Handling: AI outputs can contain PII or regulated data, necessitating encryption, redaction, and audit trails to meet compliance demands.
Evolving Threat Landscape: The complexity and dynamism of AI models and APIs call for continuous posture management and adaptive threat detection.

Enhancing Security with F5 Distributed Cloud API Security

F5 Distributed Cloud provides a comprehensive set of capabilities tailored for securing modern AI inference endpoints. By integrating these capabilities with OpenShift AI deployments on ROSA, organizations gain:

Automated API Discovery & Posture Management:
- Identify all inference endpoints automatically, eliminating hidden APIs.
- Enforce schemas based on observed traffic to ensure requests and responses follow expected patterns.
- Integrate “shift-left” security checks into CI/CD pipelines, catching misconfigurations before production.
LLM-Aware Threat Detection & Request Validation:
- Detect attempts to manipulate prompts or break compliance rules, ensuring suspicious requests are blocked before reaching the model.
Bot Mitigation & Adaptive Rate Controls:
- Differentiate between legitimate users and bots or scrapers, blocking automated attacks.
- Dynamically adjust rate limits and policies based on usage history and real-time conditions, maintaining performance and reliability.
Sensitive Data Redaction & Compliance:
- Identify and mask PII or sensitive tokens in requests and responses.
- Adhere to data protection regulations and maintain detailed logs for auditing, monitoring, and compliance reporting.
Seamless Integration & Observability:
- Deploy F5 Distributed Cloud seamlessly alongside OpenShift AI on ROSA without reshuffling existing architecture.
- Use centralized dashboards and analytics to monitor real-time metrics—latency, error rates, compliance indicators—to continuously refine policies and adapt to emerging threats.

Example: Working with Multiple OLLAMA Models and Programmatic Inference

In this scenario, multiple OLLAMA models have been deployed on the OpenShift cluster. For instance:

sh-5.1$ ollama ls 
NAME                    ID              SIZE    MODIFIED           
llama2:7b               78e26419b446    3.8 GB  3 days ago        
llama3.2:1b             baf6a787fdff    1.3 GB  2 weeks ago       
mario:latest            7434c42677ab    3.8 GB  12 seconds ago    
mistral:latest          f974a74358d6    4.1 GB  11 days ago       
orca-mini:latest        2dbd9f439647    2.0 GB  About a minute ago
phi3:latest             4f2222927938    2.2 GB  2 weeks ago       
phi3:mini               4f2222927938    2.2 GB  2 weeks ago       
tinyllama:latest        2644915ede35    637 MB  2 weeks ago

We have a variety of models—ranging from Mistral-7B to tinyllama—that can be served simultaneously. While the environment is currently using CPUs for hosting these models, you could leverage GPUs for better performance in a production scenario.

Unlike the model training and fine-tuning phase—where we worked directly within OpenShift AI Workbenches (to be covered in future blogs)— in this scenario we’re querying the LLM endpoint from a local application, where Python libraries and LangChain are installed. Instead of hitting the cluster directly, we route traffic through an F5 Distributed Cloud-managed URL (e.g., http://llm01.volt.thebizdevops.net). In a real-world deployment, the frontend could also be hosted on OpenShift or served through F5 Distributed Cloud Regional Edges (RE), providing flexible options for scaling and delivering the application.

This ensures that all requests pass through F5 Distributed Cloud’s security layers, applying policies, detecting threats, and protecting sensitive data before they reach the LLM endpoint hosted in OpenShift AI on ROSA.

Figure 2: Integrated Architecture with F5 Distributed Cloud and OpenShift AI on AWS

F5 Distributed Cloud Capabilities in Action

To illustrate the key F5 Distributed Cloud features, we’ve divided them into distinct capabilities and included screen captures in the F5 Distributed Cloud console. These captures provide a visual reference to understand how each capability enhances the security and compliance posture of your LLM inference endpoints.

1. API Discovery & Schema Enforcement

What It Does: F5 Distributed Cloud automatically identifies all exposed inference endpoints for your AI/LLM models. It then derives schemas from real traffic, enforcing expected request and response formats. By blocking malformed inputs early, your model stays protected, ensuring consistent, reliable, and trustworthy inferences.

Figure 3: API Discovery & Schema Enforcement
(Refer to the annotated image above showing multiple discovered endpoints, shadow APIs, and downloadable OpenAPI specifications derived from actual traffic patterns.)

2. LLM-Aware Threat Detection & Request Validation

What It Does: This feature identifies potential threats to your LLM endpoints by enforcing strict OpenAPI-based validation on incoming requests. By catching invalid or suspicious inputs early, you can adjust policies to block them in the future, ensuring that malicious attempts—whether aiming to exploit the LLM’s behavior or break compliance rules—never reach the inference logic.

Figure 4: Security Analysis for a Non-Compliant Request
(Here, the request triggered an OpenAPI validation failure. Although currently “allowed,” policies can be easily configured to “block” these events going forward, preventing non-compliant or potentially harmful requests from impacting your LLM models.)

Figure 5: API Inventory Validation Configuration
(This image illustrates the corresponding configuration settings for OpenAPI Validation. By validating both requests and responses at multiple layers—headers, body, and content-type—F5 Distributed Cloud ensures that LLM prompts remain safe, compliant, and free from injection attacks.)

3. Bot Mitigation & Rate Limiting

What It Does: Differentiates legitimate user traffic from automated bots or scrapers bot mitigation, and ensures fair usage of resources (rate limiting). By differentiating legitimate requests from bot-driven abuse and enforcing request thresholds, F5 Distributed Cloud protects inference endpoints from performance degradation while maintaining a positive user experience.

Figure 6: Bot Defense in Action
(This image shows how F5 Distributed Cloud identifies and classifies automated traffic as “Bad Bots,” blocking them to preserve endpoint availability and prevent resource exhaustion.)

Figure 7: Configuring Rate Limits
(This image illustrates the setup of request thresholds for a specific endpoint, ensuring no single client overwhelms the inference service.)

Figure 8: Enforced Rate Limit (429 Too Many Requests)
(This image demonstrates a client exceeding the above-configured request limit and receiving a 429 response, confirming that F5 Distributed Cloud’s rate limiting is actively maintaining fair resource allocation.)

4. Sensitive Data Redaction & Compliance Logging

What It Does: Identifies and masks personally identifiable information (PII) or other sensitive data—such as credit card numbers, emails, and phone numbers—within model responses. New Sensitive Data Exposure Rules allow you to customize and enforce policies to block or redact sensitive fields dynamically. This ensures compliance with frameworks like HIPAA, GDPR., and other regulatory mandates while capturing detailed logs for auditing.

Figure 9: Adding Sensitive Data Exposure Rules
(This image shows how you can add custom rules to detect and control exposure of sensitive fields, such as card-expiration dates, phone numbers, and credit-card details, ensuring model responses comply with organizational security policies.)

Figure 10: Sensitive Data Detection Across APIs
(Here, sensitive data types—like social-security numbers and phone numbers—are automatically detected across API responses. Built-in and custom rules flag potential exposures, empowering teams to enforce redaction and maintain compliance.)

Figure 11: Service Policy for Model Validation
(This image shows a service policy in action, blocking an inference request that doesn’t meet defined model validation criteria. Such policies can also be tied to compliance mandates, ensuring non-compliant responses are never returned to clients.)

Figure 12: API Compliance & Sensitive Data Detection
(Here, sensitive fields like credit-card or phone-number are automatically identified, and associated compliance frameworks (HIPAA, GDPR) are recognized. This empowers you to enforce data redaction, maintain regulatory compliance, and produce audit-ready logs without revealing sensitive details.)

5. Centralized Observability & Continuous Policy Updates

What It Does: Offers dashboards and analytics tools to monitor request volumes, latency, errors, and compliance metrics across your AI inference endpoints. Security teams can leverage these observations to continuously refine their policies, enhance schema validations, and recalibrate rate limits as threats evolve or model usage grows.

Figure 13: Endpoint-Level Metrics Dashboard
(This example shows an LLM endpoint /api/generate with available metrics including error rate, latency, request rate, request size, response size, and throughput. By monitoring these trends, teams can quickly identify performance bottlenecks, detect anomalies, and apply targeted policy changes to maintain optimal efficiency and security.)

The Outcome: Secure, Compliant, and Performant LLM Serving

By combining Red Hat OpenShift AI on ROSA with F5 Distributed Cloud, organizations can:

Confidently serve multiple LLMs at scale, handling diverse use cases and workloads.
Securely expose inference endpoints, ensuring that requests from external clients are validated, sanitized, and protected against prompt injection, unauthorized access, or excessive traffic.
Maintain compliance and privacy, redacting sensitive data and logging requests for auditing and reporting purposes.
Continuously adapt to evolving threats, leveraging real-time observability and agile policy management for persistent security improvements.

This powerful combination enables generative AI models to be woven into complex enterprise workflows—such as insurance claims processing—without sacrificing trust, governance, or user satisfaction.

Conclusion

Red Hat OpenShift AI on ROSA, bolstered by F5 Distributed Cloud API Security, provides a robust, scalable, and secure foundation for running generative AI workloads in production. Together, they address the nuanced security challenges of exposing LLM inference endpoints to external clients.

Whether you are working with a single LLM or managing a portfolio of models like OLLAMA’s Mistral, Phi3, and TinyLlama, this combination ensures that your users—connecting from anywhere—can trust the quality, security, and compliance of the AI services they rely on.

Published Dec 18, 2024

Version 1.0