The Operational Model for Cloud/Automated Systems Architectures

Recap

Previous Article: The Deployment Model for Cloud/Automated Systems Architectures

Ok. We're almost there. The finish line is in sight! Last week we covered our Deployment Model. This week we will wrap up this series of articles by covering our Operational Model and our Conclusion.

Operational Model

So, we've covered how we use our Service Model to define Appropriately Abstracted L4-7 services. We then use our Deployment Model to deploy those Services into various Environments using DevOps Continuous Deployment and Continuous Improvement methodologies. One way to think about the timeline is the following:

Service Model: Day < 0
Deployment Model: Day 0
Operational Model: Day 0+n

Our Operational Model is focused on Operating Automated Systems in a safe, production manner. From a consumer perspective the Operational Model is, in many ways, the most critical because it's being interacted with on a regular basis; which ties in with our previous definition:

"Provides stable and predictable workflows for changes to production environments, data and telemetry gathering and troubleshooting."

Following the pattern from our previous article lets cover the Truths, Attributes and an F5 Expression of the Operational Model.

Operational Model Truths

Lets take a look at the Truths for the Operational Model:

Support Mutability from Triggered and Sensed Metrics
Mutability Must Consume a Source of Truth
Bound Elasticity within the capabilities of Deployment Model Scalability attributes
Enable Break/Fix & Troubleshoot Operations
Provide Analytics & Visibility

Support Mutability from Triggered or Sensed Metrics

In the previous articles we mentioned the term Mutate (or in this context Mutability) a few times. Mutate, mutations and mutability are all ways of saying changes to a Service Deployment in an Environment. These changes can be large, such as deploying a Service in a new Environment; or small, such as updating the Server IP's contained within a Pool or resources.

Triggered Mutability is predicated on a system outside of the vendor specific automation framework directing a Mutation of a Service. These mutation actions can be triggered by either other Automated Systems or by Humans.

Sensed Mutability is predicated on the vendor specific automation framework sensing that a change to the Service Deployment is required and effecting the change in an automated fashion.

When implementing our Operational Model it is critical that we define which mutations of the Service are Triggered or Sensed. Furthermore, the model should consume as many Sensed Mutations as possible.

Mutability Must Consume a Source of Truth

Mutations of a Service outside of the Source of Truth (Out-of-band, or, OOB) result in a fundamental problem in Computer Science called the Consensus Problem. This problem is defined as:

"The consensus problem requires agreement among a number of processes (or agents) for a single data value. Some of the processes (agents) may fail or be unreliable in other ways, so consensus protocols must be fault tolerant or resilient. The processes must somehow put forth their candidate values, communicate with one another, and agree on a single consensus value." [1]

This truth can be summed up simply:

"No Out-of-Band Changes. Ever!"

When OOB changes occur in most Environments it is not possible to reach consensus in an automated fashion. This results in a human having to act as the arbiter of all disputes, and, can have massive impacts on the reliability of the system. To avoid this issue we must drive all Operational Mutations through a Source of Truth so the system remains in a consistent state.

References: [1] https://en.wikipedia.org/wiki/Consensus_(computer_science)

Bound Elasticity within the capabilities of Deployment Model Scalability attributes

In our previous article we discussed the Mutable Scalability attribute of the Deployment Model. One of the key desirable attributes is the ability to the scale infrastructure resources Elastically with user load. It's important to understand that Elasticity is an Operational Mutation of the underlying Scalability attribute of an Environment; therefore, we must bound our expression of Elasticity within the capabilities of the Mutable Scalability attribute in the Deployment Model.

Enable Break/Fix & Troubleshoot Operations

One of the critical decisions that must be made when designing Automated Systems is how anomolous operations can be identified and resolved. A good analogy to use here is the modern airliner. Both Boeing and Airbus produce safe, efficient and reliable airplanes; however, there is a critical difference in how Boeing and Airbus design their control systems.

Boeing designs its control systems on the premise that the pilot is always in charge; they have as-direct-as-possible control over the planes flight envelope. This includes allowing the pilots to control the plane in a way that may be deemed as exceeding the limits of its design.

Airbus, on the other hand, designs its control systems on the idea that the pilot inputs are an input to an automated system. This system then derives decisions on how to drive the control surfaces of the plane based on pilot and other inputs. The system is designed to prevent or filter out pilot input that exceeds the designed safety limits of the plane.

Personal opinions aside, in an emergency scenario, there is not necessarily any right answer on which system can overcome an anomaly. The focus is instead on training the pilots to understand how to interact with the underlying system and resolve the issue.

For this Architecture we've picked the 'Boeing' model. The reason behind this is that a reliable model for determining the 'flight envelope' does not always exist. Without this it is not possible to predictably provide the correct resolution for an anomoly (which is what the Airbus model requires).

We have purposely designed our systems to give the operator FULL control of the system at all times. The caveat here is that you should either drive change through a Source of Truth OR disable automation until an issue is resolved.

Provide Analytics & Visibility

All good Operational Models are predicated on the ability to monitor the underlying system in a concise and efficient manner. This visibility needs to be more than just a stream of log data. The Automated System should properly identify relevant data and surface that information as needed to inform automated or manual Operational Mutations.

This data should be analyzed over time to provide insights into how the system operates as a whole. This data is then used to help form our Continuous Improvement feedback loops. resulting in the ability to iterate our models over time.

Operational Model Attributes

Now that we've covered our truths lets take a look at our Attributes:

Mutability
Source of Truth
Elasticity
Continuous Ops
Analytics & Visibility

Mutability

"Use the inherent toolchain available to the deployment"

To implement Operational Mutability we should always use the underlying toolchain in the Deployment Model. This means that the Operators in our environment should understand the toolchain used in the Deployment Model and how they can interact with it in a safe, reliable manner.

Source of Truth

We've discussed Source of Truth quite a bit. We include this item as an Attribute to reinforce that

"Operational changes should be driven from Source of Truth"

Elasticity

"Elasticity can be Triggered or Sensed, however, must be bound by the Deployment Model"

Building off the explanation in our Truths section:

We could implement two different types of Scalability in the Deployment Model:

Service Level: Consume a elastic scale mutation of compute resources by adding Pool Members to a Pool
Environment Level: Scale BIG-IP instances elastically based on requests per second to a large web app.

If we've only implemented Service Level Elasticity then our Operational Model should reflect that we only allow operational mutations at the Service Level.

Continuous Ops

"Always Fail Forward"

What does this mean? Lets looks at it's complementary definition:

"Don't roll back!"!

Uncomfortable yet? Most people are! The idea behind a "Fail Forward" methodology is that issues should always be resolved in a forward manner that leverages automation. The cumulative effect of years of 'roll back' operational methodology is that most infrastructures are horribly behind in various areas (software/firmware versions, best practice config, security patches, etc.) A Fail Forward methodology allows us Continuously Improve and Continually deliver innovation to the market.

Analytics & Visibility

We covered most of the details in our Truths. This attribute serves a reminder that without Analytics & Visibility we cannot effectively implement Continous Improvement for our Model and the overall Architecture.

Operational Model - F5 Expression

This final slide shows an example of how to implement all the Attributes of the Operational Model using F5 technology. As we've discussed, it's not required to implement every attribute in the first iteration.

The slide references some F5 specific technology such as iApp's, iWorkflow (iWf), etc. For context here are links to more documentation for each tool:

iApps: https://devcentral.f5.com/s/iapps
iWorkflow: https://devcentral.f5.com/s/iworkflow
App Services iApp: https://devcentral.f5.com/s/wiki/iapp.appsvcsiapp_index.ashx
Splunk iApp: https://devcentral.f5.com/s/articles/f5-analytics-iapp

Conclusion

We've laid a good foundation in this article series. Where do we go from here? Well, first, I would recommend taking some time to look at what you're trying to accomplish and fitting it into our various models. The best way to do this is to start with a blank slate. Take a look at our attribute slides and fill them in with what works for you problem set. Then take those attributes and validate them with our Architectural and Model Truths.

After a couple of iterations a path forward should appear. At that point call out to your F5 account team and ask for a F5 Systems Engineer that specializes in Cloud. We've trained a global team of 150 SE's on this same material (using DevOps methodologies of course) and we are ready to help you move forward and leverage Automation to:

Deliver YOUR innovation to the market

Keep an eye on DevCentral in the coming weeks. We will be publishing articles that take this series one step further by showing Environment specific Implementations with technology partners such as OpenStack, Cisco ACI, vmWare ESX, Amazon AWS, Microsoft Azure and Google Cloud Platform.

Thank you all for taking the time to read this series and we'll see you next time!

Published Jun 27, 2017

Version 1.0