To Err is Human
#devops Automating incomplete or ineffective processes will only enable you to make mistakes faster – and more often. Most folks probably remember the play on "to err is human…" proverb when computers first began to take over, well, everything. The saying was only partially tongue-in-cheek, because as we've long since learned the reality is that computers allow us to make mistakes faster and more often and with greater reach. One of the statistics used to justify a devops initiative is the rate at which human error contributes to a variety of operational badness: downtime, performance, and deployment life-cycle time. Human error is a non-trivial cause of downtime and other operational interruptions. A recent Paragon Software survey found that human error was cited as a cause of downtime by 13.2% of respondents. Other surveys have indicated rates much higher. Gartner analysts Ronni J. Colville and George Spafford in "Configuration Management for Virtual and Cloud Infrastructures" predict as much as 80% of outages through 2015 impacting mission-critical services will be caused by "people and process" issues. Regardless of the actual rates at which human error causes downtime or other operational disruptions, reality is that it is a factor. One of the ways in which we hope to remediate the problem is through automation and devops. While certainly an appropriate course of action, adopters need to exercise caution when embarking on such an initiative, lest they codify incomplete or inefficient processes that simply promulgate errors faster and more often. DISCOVER, REMEDIATE, REFINE, DEPLOY Something that all too often seems to be falling by the wayside is the relationship between agile development and agile operations. Agile isn't just about fast(er) development cycles, it's about employing a rapid, iterative process to the development cycle. Similarly, operations must remember that it is unlikely they will "get it right" the first time and, following agile methodology, are not expected to. Process iteration assists in discovering errors, missing steps, and other potential sources of misconfiguration that are ultimately the source of outages or operational disruption. An organization that has experienced outages due to human error are practically assured that they will codify those errors into automation frameworks if they do not take the time to iteratively execute on those processes to find out where errors or missing steps may lie. It is process that drives continuous delivery in development and process that must drive continuous delivery in devops. Process that must be perfected first through practice, through the application of iterative models of development on devops automation and orchestration. What may appear as a tedious repetition is also an opportunity to refine the process. To discover and eliminate inefficiencies that streamline the deployment process and enable faster time to market. Inefficiencies that are generally only discovered when someone takes the time to clearly document all steps in the process – from beginning (build) to end (production). Cross-functional responsibilities are often the source of such inefficiencies, because of the overlap between development, operations, and administration. The outage of Microsoft’s cloud service for some customers in Western Europe on 26 July happened because the company’s engineers had expanded capacity of one compute cluster but forgot to make all the necessary configuration adjustments in the network infrastructure. -- Microsoft: error during capacity expansion led to Azure cloud outage Applying an agile methodology to the process of defining and refining devops processes around continuous delivery automation enables discovery of the errors and missing steps and duplicated tasks that bog down or disrupt the entire chain of deployment tasks. We all know that automation is a boon for operations, particularly in organizations employing virtualization and cloud computing to enable elasticity and improved provisioning. But what we need to remember is that if that automation simply encodes poor processes or errors, then automation just enables to make mistakes a whole lot faster. Take care to pay attention to process and to test early, test often. How to Prevent Cascading Error From Causing Outage Storm In Your Cloud Environment? DOWNTIME, OUTAGES AND FAILURES - UNDERSTANDING THEIR TRUE COSTS Devops Proverb: Process Practice Makes Perfect 1024 Words: The Devops Butterfly Effect Devops is a Verb BMC DevOps Leadership Series305Views0likes0CommentsLoad Balancing on the Inside
Business critical internal processing systems often require high-availability and fault tolerance, too. Load balancing and application delivery is almost always associated with scaling out interactive, web-based applications. Rarely does anyone think about load balancing and application delivery in batch processing systems even when those systems might be critical to the business they are supporting. But scaling out non-interactive processing systems and providing high-availability to such critical systems is just as easily accomplished for an application delivery controller (ADC) as it is to scale out an interactive web-based application. Maybe easier. When that system also requires a bit more intelligence than just simple load balancing, it makes a lot of sense to look closer at a context-aware system that can support all the requirements in a single solution. THE SCENARIO A batch document processing system uses a document ID to match all related documents to the same “case.” The first time a document ID is encountered, it creates a new “case” and subsequent documents bearing that ID are attached to the original case. To ensure processing around the clock, a redundant set of application servers is configured to process the documents, and the vendor’s application server clustering solution is used to load balance documents (in simple round-robin fashion) across the two instances. A load test is conducted, ramping up to 2500 documents per hour (41 per minute, fewer than 1 per second). During the test it is discovered that in some situations two documents with the same ID will arrive at the clustering solution in order. They will each be load balanced to separate instances. There is no existing “case” for this document id. Because of processing times and load on the servers, both documents result in the creation of separate “cases.” The test is considered a failure. Because the system, while managing the load fine from a network perspective, executed incorrectly under load from a process perspective. The solution? Reconfigure the clustering solution to an active-standby configuration, thus introducing the process latency needed to ensure that the scenario does not occur. Retest. Success. The result? The investment in the second instance of the application server – hardware, software licenses, management, maintenance – is wasted. It is a “failover” node only and reduces the overall capacity – and ultimately performance at higher load levels – of the system. WHEN CONTEXT MATTERS This scenario is real; it was described to me by a program manager at a Fortune 500 with a great deal of frustration as it seemed, to her anyway, that the architects could not come up with a working solution other than wasting a perfectly good set of resources. Instinctively she described a solution that leveraged persistence to force all documents with the same ID to the same server as it had been proven repeatedly that if all documents with the same ID were processed by the same application server that the system processed them correctly and associated them with the right “case” in all situations. But the application server clustering solution, which can provide server affinity (persistence) based on a few variables, was for some reason not able to support affinity (persistence) based on the document ID. After a few questions regarding the overall system and processing times it became clear that a context-aware application delivery controller could indeed solve this problem. The solution is fairly simple, actually, and based on existing persistence-based load balancing solutions. It is a given that documents with the same ID are batch processed within minutes of each other. Thus, a persistence table with a life of an hour or even thirty-minutes would provide the proper context in which documents could be processed and directed to the “right” web application server. This requires context; it requires that the load balancing solution, the application delivery controller, be aware of not only what it is processing but what it has processed already, and where it’s been sent. Document ID Based Persistence Logic Extract the document ID from the document Check the persistence table for the document ID If the document ID already exists, route the document to the same server as the previous document(s) with that ID If the document ID does not exist, decide which server the document will be sent to for processing and create an entry in the persistence table Wash. Rinse. Repeat. This problem is really about process level execution; about enforcing a business requirement on the technological implementation. In order to achieve compliance with the business process expectations it is necessary to be able to view each request in the context of that process rather than as an individual request that needs to be executed. Thus each touch point in the architecture that needs to manipulate, transform, or perform some task with or on or to the request needs to be able to take into consideration the process; it needs to be context-aware so that its decisions are made within the context of the entire process and not just the individual request. Layer 7 switching, application load balancing, application delivery. Whatever you want to call it, it is the way in which load balancing becomes context-aware and becomes collaborative. It enables the business requirements to be not only taken into consideration but enforced while ensuring that CapEx and OpEx investments in additional systems are not left to sit idle; wasted. It improves capacity essentially by introducing process latency into the equation. By forcing the process to follow a particular path the application delivery controller assists in the technological implementation meeting the goals of the business. In order words, it aligns IT with the business. Sometimes the marketing fluff is more solid than it appears. To Boldly Go Where No Production Application Has Gone Before WILS: Network Load Balancing versus Application Load Balancing Sessions and Cookies and Persistence, oh my! Persistent and Persistence, What's the Difference? If Load Balancers Are Dead Why Do We Keep Talking About Them? A new era in application delivery Infrastructure 2.0: The Diseconomy of Scale Virus The Politics of Load Balancing Business-Layer Load Balancing Not all application requests are created equal239Views0likes1CommentWhen (Micro)Seconds Matter
In cloud computing environments the clock literally starts ticking the moment an application instance is launched. How long should that take? The term “on-demand” implies right now. In the past, we used the term “real-time” even though what we really meant in most cases was “near time”, or “almost real-time”. The term “elastic” associated with scalability in cloud computing definitions implies on-demand. One would think, then, that this means that spinning up a new instance of an application with the intent to scale a cloud-deployed application to increase capacity would be a fairly quick-executing task. That doesn’t seem to be the case, however. Dealing with unexpected load is now nothing more than a 10 minute exercise in easy, seamlessly integrating both cloud and data center services. -- Cloud computing, load balancing, and extending the data center into a cloud, The Server Room A Twitter straw poll on this subject (completely unscientific) indicated an expectation that this process should (and for many does) take approximately two minutes in many cloud environments. Minutes, not seconds. Granted, even that is still a huge improvement over the time it’s taken in the past. Even if the underlying hardware resources are available there’s still all of the organizational IT processes that need to be walked through – requests, approvals, allocation, deployment, testing, and finally the actual act of integrating the application with its supporting network and application delivery network infrastructure. It’s a time-consuming process and is one of the reasons for all the predictions of business users avoiding IT to deploy applications in “the cloud.” IT capacity planning strategy has been to anticipate the need for additional capacity early enough that the resources are available when the need arises. This has typically resulted in over-provisioning, because it’s based on the anticipation of need, not actual demand. It’s based on historical trends that, while likely accurate, may over or under-estimate the amount of capacity required to meet historical spikes in demand.169Views0likes0CommentsData Center Feng Shui: Process Equally Important as Preparation
Like Subway, too often we fail to recognize that ingredients is only half a successful recipe. Process is the other half. The response from sufferer’s of Celiac Disease (and similar conditions) to Subway’s announcement it was trying out a new, gluten-free version of some of its sandwiches was heavily weighted toward excitement. One of the most frustrating effects of suffering from Celiac’s is, of course, a lack of fast and tasty options for mealtime. We simply can’t run out to Subway or any other traditional “fast food” restaurant for a bite because, well, most of the menu is laden with gluten, which is a no-no. So for a national chain like Subway to roll-out a gluten-free version of its offerings, well, it was like manna from heaven. Or was it? What most (non-suffering) folks miss about Celiac’s (and almost every article in the mainstream press about it) is that ingredients, the diet, are only half the picture. The other half – the far more difficult half – is the environment in which those ingredients are thrown together and ultimately delivered for consumption. Cross-contamination is probably the primary source of Celiac reactions when eating out. Sufferers know what they can and cannot eat and restaurants are increasingly aware of such restrictions and are quick to provide the information diners need to make decisions about what they order. But what diners have no control over and what preparers often fail to pay proper attention to is the environment. A tiny bit of gluten – fragments smaller than you can see, such as the residue from wheat flour that hangs in the air nearly 24 hours after use – can cause a reaction. Reactions, over time, cause permanent damage that can, well, let’s just say the prognosis is poor for many of the conditions resulting from that damage. So while I was happy to see Subway announce their new effort, I remain unconvinced of their ability to deliver gluten-free food amidst the gluten-infested kitchens that necessarily are a Subway store. Face it. Subway is about subs, bread, which traditionally means wheat – and flour. There are crumbs everywhere and residue and, well, I’ve been to Subway and all you need to do is watch a sandwich being pushed down the line and see the crumbs flying into everything to know that someone with Celiac’s should simply just say no. IT organizations, too, often fail to consider the way in which the process by which application delivery ingredients will impact the end-user experience. Ensuring applications are fast, secure, and available requires following a strategic process that ensures the proper controls and enhancements are applied at the right time and in the right place to keep that application performing well. PROCESS, PROCESS, PROCESS Delivering an application used to be a matter of simply responding to an HTTP request. While in its simplest form that is still true, there are myriad touch points on the path from server back to client that may be required to interact – or even modify – the data that makes up the response. The trick here is to ensure that no touch point along that path impedes performance, security, or availability. Adding value to the delivery of a response should not introduce significant latency, nor impede the ability of other touch points along the path to perform their tasks. The goal should always be to improve the security or performance of the application and its data without contaminating it and causing problems along the way. That’s easier said than done, especially in an environment over which you have very little control (such as a public cloud computing or hosted environment) or in which multiple solutions are used to implement a variety of security and performance-related delivery options. The order of operations here is important but all too often overlooked. Applying SSL at the web server, for example, instead of at a more strategic point closer to the client, has a significant impact on the delivery of the application. First and foremost, it’s inefficient. General-purpose compute is just that – general purpose – and cryptographic computations are resource intense, benefiting greatly from specialized compute designed specifically to enhance the speed with which such computations are executed. Secondly, encrypting data at the web server means that any other task in the delivery path that needs to inspect and/or act upon that data – such as data leak prevention services – must decrypt, perform its task, and then re-encrypt the data. That not only adds latency, it also requires the server’s certificate and key, which requires additional management and adds another potential point at which such sensitive corporate assets might be stolen. If that’s not possible, the only other option is – to skip that task, because the result is to make the data “invisible” for all intents and purposes through the remainder of the delivery path. It’s not just security that’s impacted. Many acceleration technologies such as caching and data compression can be applied at the web server tier or at any of several points along the delivery path. Whether these are beneficial or not is highly dependent on context (network, end-user, and data center conditions) that is not available to the web server. It is often the case that compression and caching are treated as separate entities, deployed individually in disparate components along the path, which also results in a loss of the context necessary to efficiently apply such technologies and which adds latency that may eventually offset the performance gains they were intended to supply. The order of operations, it turns out, the process, is as important as the ingredients (components) used. Just as adherents to the philosophy of Feng Shui believe that where is as important as what, so too can this philosophy be applied to data center architectural strategy. Where you deploy the tools you leverage for security, storage, and application delivery are as important to the health of the data center as which tools you choose. The right process can improve security, enhance performance, and assure availability. The wrong process can negate security, degrade performance, and do nothing at all to assist in maintaining a reliable application. A RECIPE FOR … SUCCESS Adherence to process is not just a requirement for “safe” eating for sufferers of Celiac’s and other food-related allergies. Any good cook will tell you that ingredients are only part of a delicious meal; the process – the order – by which those ingredients are tossed, basted, mixed, and combined can have a significant impact on the outcome of the dish. This is just as true for technology, especially those that rely heavily on processes over product. Cloud computing is more about process than it is products; it’s about the integration and collaboration that enables automation of operational processes to deliver more consistent results, reduced time to deploy, and decreased administration costs. The difference between a highly virtualized data center and a cloud is, in its simplest form, process. Virtualization platforms are products, but what makes a cloud is the processes and the implementation of automation and ultimately orchestration to liberate the data center from the mundane checklist upon checklist of manual tasks that must be accomplished to deploy, scale, and manage application deployments. The same is true for application delivery. Having the right ingredients is great, but the process by which they are applied and subsequent interact with one another is just as important to ensuring the successful delivery of applications. Every time an application delivery or data service “contaminates” the application data with latency it impacts the consumer, the end-user. It makes applications slower, less reliable, and impacts the end-user experience. The next time you have a problem with an application and you decide an application delivery service is the right solution to provide relief, take that opportunity to examine the entire process and ensure that the way in which those ingredients are delivered are not negating the benefits they are designed to provide. You can learn more about Celiac’s Disease (also commonly called Celiac Sprue) by visiting the Celiac Sprue Association. The Gluten-free Application Network Knowing is Half the Battle Putting the Cloud Before the Horse If You Focus on Products You’ll Miss the Cloud The Order of (Network) Operations The Zero-Product Property of IT Like Load Balancing WAN Optimization is a Feature of Application Delivery F5 Friday: Application Access Control - Code, Agent, or Proxy? What is a Strategic Point of Control Anyway? Top-to-Bottom is the New End-to-End167Views0likes1Comment