on 21-Feb-2011 01:42
Recognizing the relationship between and subsequently addressing the three core operational risks in the data center will result in a stronger operational posture.
Risk is not a synonym for lack of security. Neither is managing risk a euphemism for information security. Risk – especially operational risk – compromises a lot more than just security.
In operational terms, the chance of loss is not just about data/information, but of availability. Of performance. Of customer perception. Of critical business functions. Of productivity. Operational risk is not just about security, it’s about the potential damage incurred by a loss of availability or performance as measured by the business. Downtime costs the business; both hard and soft costs are associated with downtime and the numbers can be staggering depending on the particular vertical industry in which a business may operate. But in all cases, regardless of industry, the end-result is the same: downtime and poor performance are risks that directly impact the bottom line.
Operational risk comprises concerns regarding:
These three concerns are intimately bound up in one another. For example, a denial of service attack left unaddressed and able to penetrate to the database tier in the data center can degrade performance which may impact availability – whether by directly causing an outage or through deterioration of performance such that systems are no longer able to meet service level agreements mandating specific response times. The danger in assuming operational risk is all about security is that it leads to a tunnel-vision view through which other factors that directly impact operational reliability may be obscured. The notion of operational risk is most often discussed as it relates to cloud computing , but it is only that cloud computing raises the components of operational risk to a visible level that puts the two hand-in-hand.
It’s the processes – the configuration and policy deployment – involving the underlying network and application network infrastructure that we seek to make repeatable, to avoid the inevitable introduction of errors and subsequently downtime due to human error. This is not to say that security is not part of that repeatable process because it is. It’s to say that it is only one piece of a much larger set of processes that must be orchestrated in such a way as to provide for consistent repetition of successful deployments that alleviates operational risk associated with the deployment of applications.
Human error by contractor Northrop Grumman Corp. was to blame for a computer system crash that idled many state government agencies for days in August, according to an external audit completed at Gov. Bob McDonnell's request.
The audit, by technology consulting firm Agilysis and released Tuesday, found that Northrop Grumman had not planned for an event such as the failure of a memory board, aggravating the failure. It also found that the data loss and the delay in restoration resulted from a failure to follow industry best practices.
At least two dozen agencies were affected by the late-August statewide crash of the Virginia Information Technologies Agency. The crash paralyzed the departments of Taxation and Motor Vehicles, leaving people unable to renew drivers licenses. The disruption also affected 13 percent of Virginia's executive branch file servers.
-- Audit: Contractor, Human Error Caused Va Outage (ABC News, February 2011)
There are myriad points along the application deployment path at which an error might be introduced. Failure to add the application node to the appropriate load balancing pool; failure to properly monitor the application for health and performance; failure to apply the appropriate security and/or network routing policies. A misstep or misconfiguration at any point in this process can result in downtime or poor performance, both of which are also operational risks. Virtualization and cloud computing can complexify this process by adding another layer of configuration and policies that need to be addressed, but even without these technologies the risk remains.
There are two sides to operational efficiency – the deployment/configuration side and the run-time side. During deployment it is configuration and integration that is the focus of efforts to improve efficiency. Leveraging devops and automation as a means to create a repeatable infrastructure deployment process is critical to achieving operational efficiency during deployment.
Achieving run-time operational efficiency often utilizes a subset of operational deployment processes, addressing the critical need to dynamically modify security policies and resource availability based on demand. Many of the same processes that enable a successful deployment can be – and should be – reused as a means to address changes in demand. Successfully leveraging repeatable sub-processes at run-time, dynamically, requires that operational folks – devops – takes a development-oriented approach to abstracting processes into discrete, repeatable functions. It requires recognition that some portions of the process are repeated both at deployment and run-time and then specifically ensuring that the sub-process is able to execute on its own such that it can be invoked as a separate, stand-alone process.
This efficiency allows IT to address operational risks associated with performance and availability by allowing IT to react more quickly to changes in demand that may impact performance or availability as well as failures internal to the architecture that may otherwise cause outages or poor performance which, in business stakeholder speak, can be interpreted as downtime.
RISK FACTOR: Repeatable deployment processes address operational risk by reducing possibility of downtime due to human error.
These operational risks are often addressed on a per-incident basis, with reactive solutions rather than proactive policies and processes. A proactive approach combines repeatable deployment processes to enable appropriate auto-scaling policies to combat the “flash crowd” syndrome that so often overwhelms unprepared sites along with a dynamic application delivery infrastructure capable of automatically adjusting delivery policies based on context to maintain consistent performance levels.
Downtime and slowdown can and will happen to all websites. However, sometimes the timing can be very bad, and a flower website having problems during business hours on Valentine’s Day, or even the days leading up to Valentine’s Day, is a prime example of bad timing.
In most cases this could likely have been avoided if the websites had been better prepared to handle the additional traffic. Instead, some of these sites have ended up losing sales and goodwill (slow websites tend to be quite a frustrating experience).
At run-time this includes not only auto-scaling, but appropriate load balancing and application request routing algorithms that leverage intelligent and context-aware health-monitoring implementations that enable a balance between availability and performance to be struck. This balance results in consistent performance and the maintaining of availability even as new resources are added and removed from the available “pool” from which responses are served. Whether these additional resources are culled from a cloud computing provider or an internal array of virtualized applications is not important; what is important is that the resources can be added and removed dynamically, on-demand, and their “health” monitored during usage to ensure the proper operational balance between performance and availability. By leveraging a context-aware application delivery infrastructure, organizations can address the operational risk of degrading performance or outright downtime by codifying operational policies that allow components to determine how to apply network and protocol-layer optimizations to meet expected operational goals.
A proactive approach has “side effect” benefits of shifting the burden of policy management from people to technology, resulting in a more efficient operational posture.
RISK FACTOR: Dynamically applying policies and making request routing decisions based on context addresses operational risk by improving performance and assuring availability.
Operational risk comprises much more than simply security and it’s important to remember that because all three primary components of operational risk – performance, availability and security – are very much bound up and tied together, like the three strands that come together to form a braid. And for the same reasons a braid is stronger than its composite strands, an operational strategy that addresses all three factors will be far superior to one in which each individual concern is treated as a stand-alone issue.