Adopting SRE practices with F5: Observability and beyond with ELK Stack

This article is a joint collaboration between Eric Ji and JC Kwon.

Getting started

In the previous article, we explained SRE (Site Reliability Engineering) and how F5 helps SRE deploy and secure modern applications.

We already talked observability is essential for SRE to implement SLOs. Meanwhile, we have a wide range of monitoring tools and analytic applications, each assigned to special devices or running only for certain applications.

In this article, we will explore one of the most commonly utilized logging tools, or the ELK stack.

The ELK stack is a collection of three open-source projects, namely Elasticsearch, Logstash, and Kibana. It provides IT project stakeholders the capabilities of multi-system and multi-application log aggregation and analysis. Besides, the ELK stack provides data visualization at stakeholders' fingertips, which is essential for security analytics, system monitoring, and troubleshooting.

A brief description of the three projects:

Elasticsearch is an open-source, full-text analysis, and search engine.
Logstash is a log aggregator that executes transformations on data derived from various input sources, before transferring it to output destinations.
Kibana provides data analysis and visualization capabilities for end-users, complementary to Elasticsearch.

In this article, the ELK is utilized to analyze and visualize application performance through a centralized dashboard. A dashboard enables end-users to easily correlate North-South traffic with East-West traffic, for end-to-end performance visibility.

Overview

This use case is built on top of targeted canary deployment. As shown in the diagram below, we are taking advantage of the iRule on BIG-IP, generated a UUID is and inserted it into the HTTP header for every HTTP request packet arriving at BIG-IP.

All traffic access logs will contain the UUIDs when they are sent to the ELK server, for validation of information, like user location, the response time by user location, response time of BIG-IP and NGINX plus, etc.

Setup and Configuration

1. Create HSL pool, iRule on BIG-IP

First, we created a High-Speed Logging (HSL) pool on BIG-IP, to be used by the ELK Stack. The HSL pool is assigned to the sample application. This pool member will be used by iRule to send access logs from BIG-IP to the ELK server. The ELK server is listening for incoming log analysis requests

Below is the iRule that we created.

when CLIENT_ACCEPTED {
   set timestamp [clock format [clock seconds] -format "%d/%h/%y:%T %Z" ]
}
 
when HTTP_REQUEST {
   # UUID injection
   if { [HTTP::cookie x-request-id] == "" } {
       append s [clock seconds] [IP::local_addr] [IP::client_addr] [expr { int(100000000 * rand()) }] [clock clicks]
 
       set s [md5 $s]
 
       binary scan $s c* s
       lset s 8 [expr {([lindex $s 8] & 0x7F) | 0x40}]
       lset s 6 [expr {([lindex $s 6] & 0x0F) | 0x40}]
       set s [binary format c* $s]
       binary scan $s H* s
 
       set myuuid $s
       unset s
 
 
       set inject_uuid_cookie 1
   } else {
 
       set myuuid [HTTP::cookie x-request-id]
       set inject_uuid_cookie 0
   }
 
   set xff_ip "[expr int(rand()*100)].[expr int(rand()*100)].[expr int(rand()*100)].[expr int(rand()*100)]"
 
   set hsl [HSL::open -proto UDP -pool pool_elk]
   set http_request "\"[HTTP::method] [HTTP::uri] HTTP/[HTTP::version]\""
   set http_request_time [clock clicks -milliseconds]
   set http_user_agent "\"[HTTP::header User-Agent]]\""
   set http_host [HTTP::host]
   set http_username [HTTP::username]
   set client_ip [IP::remote_addr]
   set client_port [TCP::remote_port]
   set http_request_uri [HTTP::uri]
   set http_method [HTTP::method]
   set referer "\"[HTTP::header value referer]\""
 
   if { [HTTP::uri] contains "test" } {
       HTTP::header insert "x-request-id" "test-$myuuid"
 
   } else {
       HTTP::header insert "x-request-id" $myuuid
   }
   HTTP::header insert "X-Forwarded-For" $xff_ip
}
 
 
when HTTP_RESPONSE {
 
   set syslogtime [clock format [clock seconds] -format "%h %e %H:%M:%S"]
 
   set response_time [expr {double([clock clicks -milliseconds] - $http_request_time)/1000}]
 
   set virtual [virtual]
   set content_length 0
   if { [HTTP::header exists "Content-Length"] } {
       set content_length \"[HTTP::header "Content-Length"]\"
   } else {
       set content_length \"-\"
   }
   set lb_server "[LB::server addr]:[LB::server port]"
   if { [string compare "$lb_server" ""] == 0 } {
       set lb_server ""
   }
   set status_code [HTTP::status]
   set content_type \"[HTTP::header "Content-type"]\"
 
   # construct log for elk, local6.info <182>
   set log_msg "<182>$syslogtime f5adc tmos: "
   #set log_msg ""
   append log_msg "time=\[$timestamp\] "
   append log_msg "client_ip=$client_ip "
   append log_msg "virtual=$virtual "
   append log_msg "client_port=$client_port "
   append log_msg "xff_ip=$xff_ip "
   append log_msg "lb_server=$lb_server "
 
   append log_msg "http_host=$http_host "
   append log_msg "http_method=$http_method "
   append log_msg "http_request_uri=$http_request_uri "
 
   append log_msg "status_code=$status_code "
   append log_msg "content_type=$content_type "
   append log_msg "content_length=$content_length "
 
   append log_msg "response_time=$response_time "
   append log_msg "referer=$referer "
   append log_msg "http_user_agent=$http_user_agent "
   append log_msg "x-request-id=$myuuid "
 
   if { $inject_uuid_cookie == 1} {
       HTTP::cookie insert name x-request-id value $myuuid path "/"
       set inject_uuid_cookie 0
   }
 
   # log local2. sending log to elk via log publisher
   #log local2. $log_msg
   HSL::send $hsl $log_msg
}

Next, we added a new VIP for the HSL pool which was created earlier, and applied iRule for this VIP. Then all access logs containing the respective UUID for the HTTP datagram will be sent to the ELK server.

Now, the ELK server is ready for the analysis of the BIG-IP access logs.

2. Configure NGINX plus Logging

We configure logging for each NGINX plus deployed inside the OpenShift cluster through the respective configmap objects. Here is one example:

3. Customize Kibana Dashboard

With all configurations in place, log information will be processed by the ELK server. We will be able to customize a dashboard containing useful, visualized data, like user location, response time by location, etc.

When an end-user accesses the service, the VIP will be responded and iRule will apply. Next, the user’s HTTP header information will be checked by iRule, and logs are forwarded to the ELK server for analysis. As the user is accessing the app services, the app server’s logs are also forwarded to the ELK server based on the NGINX plus configmap setting.

The list of key indicators available on the Kibana dashboard page is rather long, so we won't describe all of them here. You can check detail here

4. ELK Dashboard Samples

We can easily customize the data for visualization in the centralized display, and the following is just a list of dashboard samples.

We can look at user location counts and response time by user location:

We can check the average response time and max response time for each endpoint:

We can seethe correlation between BIG-IP (N-S traffic) and NGINX plus endpoint(E-W traffic):

We can also check the response time for N-S traffic:

Summary

In this article, we showed how the ELK stack joined forces with F5 BIG-IP and NGINX plus to provide an observability solution for visualizing application performance with a centralized dashboard. With combined performance metrics, it opens up a lot of possibilities for SRE's to implement SLOs practically.

F5 aims to provide the best solution to support your business success and continue with more use cases. If you want to learn more about this and other SRE use cases, please visit the F5 DevCentral GitHub link here.

Published May 11, 2021

Version 1.0

f5 ingress controller

f5 openshift

irule

NGINX Plus

series-adopting-sre-practices-with-f5

Site Reliability Engineering