reliability
16 TopicsBuilding an elastic environment requires elastic infrastructure
One of the reasons behind some folks pushing for infrastructure as virtual appliances is the on-demand nature of a virtualized environment. When network and application delivery infrastructure hits capacity in terms of throughput - regardless of the layer of the application stack at which it happens - it's frustrating to think you might need to upgrade the hardware rather than just add more compute power via a virtual image. The truth is that this makes sense. The infrastructure supporting a virtualized environment should be elastic. It should be able to dynamically expand without requiring a new network architecture, a higher performing platform, or new configuration. You should be able to just add more compute resources and walk away. The good news is that this is possible today. It just requires that you consider carefully your choices in network and application network infrastructure when you build out your virtualized infrastructure. ELASTIC APPLICATION DELIVERY INFRASTRUCTURE Last year F5 introduced VIPRION, an elastic, dynamic application networking delivery platform capable of expanding capacity without requiring any changes to the infrastructure. VIPRION is a chassis-based bladed application delivery controller and its bladed system behaves much in the same way that a virtualized equivalent would behave. Say you start with one blade in the system, and soon after you discover you need more throughput and more processing power. Rather than bring online a new virtual image of such an appliance to increase capacity, you add a blade to the system and voila! VIPRION immediately recognizes the blade and simply adds it to its pools of processing power and capacity. There's no need to reconfigure anything, VIPRION essentially treats each blade like a virtual image and distributes requests and traffic across the network and application delivery capacity available on the blade automatically. Just like a virtual appliance model would, but without concern for the reliability and security of the platform. Traditional application delivery controllers can also be scaled out horizontally to provide similar functionality and behavior. By deploying additional application delivery controllers in what is often called an active-active model, you can rapidly deploy and synchronize configuration of the master system to add more throughput and capacity. Meshed deployments comprising more than a pair of application delivery controllers can also provide additional network compute resources beyond what is offered by a single system. The latter option (the traditional scaling model) requires more work to deploy than the former (VIPRION) simply because it requires additional hardware and all the overhead required of such a solution. The elastic option with bladed, chassis-based hardware is really the best option in terms of elasticity and the ability to grow on-demand as your infrastructure needs increase over time. ELASTIC STORAGE INFRASTRUCTURE Often overlooked in the network diagrams detailing virtualized infrastructures is the storage layer. The increase in storage needs in a virtualized environment can be overwhelming, as there is a need to standardize the storage access layer such that virtual images of applications can be deployed in a common, unified way regardless of which server they might need to be executing on at any given time. This means a shared, unified storage layer on which to store images that are necessarily large. This unified storage layer must also be expandable. As the number of applications and associated images are made available, storage needs increase. What's needed is a system in which additional storage can be added in a non-disruptive manner. If you have to modify the automation and orchestration systems driving your virtualized environment when additional storage is added, you've lost some of the benefits of a virtualized storage infrastructure. F5's ARX series of storage virtualization provides that layer of unified storage infrastructure. By normalizing the namespaces through which files (images) are accessed, the systems driving a virtualized environment can be assured that images are available via the same access method regardless of where the file or image is physically located. Virtualized storage infrastructure systems are dynamic; additional storage can be added to the infrastructure and "plugged in" to the global namespace to increase the storage available in a non-disruptive manner. An intelligent virtualized storage infrastructure can further make more efficient the use of the storage available by tiering the storage. Images and files accessed more frequently can be stored on fast, tier one storage so they are loaded and execute more quickly, while less frequently accessed files and images can be moved to less expensive and perhaps less peformant storage systems. By deploying elastic application delivery network infrastructure instead of virtual appliances you maintain stability, reliability, security, and performance across your virtualized environment. Elastic application delivery network infrastructure is already dynamic, and offers a variety of options for integration into automation and orchestration systems via standards-based control planes, many of which are nearly turn-key solutions. The reasons why some folks might desire a virtual appliance model for their application delivery network infrastructure are valid. But the reality is that the elasticity and on-demand capacity offered by a virtual appliance is already available in proven, reliable hardware solutions today that do not require sacrificing performance, security, or flexibility. Related articles by Zemanta How to instrument your Java EE applications for a virtualized environment Storage Virtualization Fundamentals Automating scalability and high availability services Building a Cloudbursting Capable Infrastructure EMC unveils Atmos cloud offering Are you (and your infrastructure) ready for virtualization?505Views0likes4CommentsApplying ‘Centralized Control, Decentralized Execution’ to Network Architecture
#SDN brings to the fore some critical differences between concepts of control and execution While most discussions with respect to SDN are focused on a variety of architectural questions (me included) and technical capabilities, there’s another very important concept that need to be considered: control and execution. SDN definitions include the notion of centralized control through a single point of control in the network, a controller. It is through the controller all important decisions are made regarding the flow of traffic through the network, i.e. execution. This is not feasible, at least not in very large (or even just large) networks. Nor is it feasible beyond simple L2/3 routing and forwarding. HERE COMES the SCIENCE (of WAR) There is very little more dynamic than combat operations. People, vehicles, supplies – all are distributed across what can be very disparate locations. One of the lessons the military has learned over time (sometimes quite painfully through experience) is the difference between control and execution. This has led to decisions to employ what is called, “Centralized Control, Decentralized Execution.” Joint Publication (JP) 1-02, Department of Defense Dictionary of Military and Associated Terms, defines centralized control as follows: “In joint air operations, placing within one commander the responsibility and authority for planning, directing, and coordinating a military operation or group/category of operations.” JP 1-02 defines decentralized execution as “delegation of execution authority to subordinate commanders.” Decentralized execution is the preferred mode of operation for dynamic combat operations. Commanders who clearly communicate their guidance and intent through broad mission-based or effects-based orders rather than through narrowly defined tasks maximize that type of execution. Mission-based or effects-based guidance allows subordinates the initiative to exploit opportunities in rapidly changing, fluid situations. -- Defining Decentralized Execution in Order to Recognize Centralized Execution * Lt Col Woody W. Parramore, USAF, Retired Applying this to IT network operations means a single point of control is contradictory to the “mission” and actually interferes with the ability of subordinates (strategic points of control) to dynamically adapt to rapidly changing, fluid situations such as those experienced in virtual and cloud computing environments. Not only does a single, centralized point of control (which in the SDN scenario implies control over execution through admittedly dynamically configured but rigidly executed) abrogate responsibility for adapting to “rapidly changing, fluid situations” but it also becomes the weakest link. Clausewitz, in the highly read and respected “On War”, defines a center of gravity as "the hub of all power and movement, on which everything depends. That is the point against which all our energies should be directed." Most military scholars and strategists logically imply from the notion of a Clausewitzian center of gravity is the existence of a critical weak link. If the “controller” in an SDN is the center of gravity, then it follows it is likely a critical, weak link. This does not mean the model is broken, or poorly conceived of, or a bad idea. What it means is that this issue needs to be addressed. The modern strategy of “Centralized Control, Decentralized Execution” does just that. Centralized Control, Decentralized Execution in the Network The major issue with the notion of a centralized controller is the same one air combat operations experienced in the latter part of the 20th century: agility, or more appropriately, lack thereof. Imagine a large network adopting fully an SDN as defined today. A single controller is responsible for managing the direction of traffic at L2-3 across the vast expanse of the data center. Imagine a node, behind a Load balancer, deep in the application infrastructure, fails. The controller must respond and instruct both the load balancing service and the core network how to react, but first it must be notified. It’s simply impossible to recover from a node or link failure in 50 milliseconds (a typical requirement in networks handling voice traffic) when it takes longer to get a reply from the central controller. There’s also the “slight” problem of network devices losing connectivity with the central controller if the primary uplink fails. -- OpenFlow/SDN Is Not A Silver Bullet For Network Scalability, Ivan Pepelnjak (CCIE#1354 Emeritus) Chief Technology Advisor at NIL Data Communications The controller, the center of network gravity, becomes the weak link, slowing down responses and inhibiting the network (and IT) from responding in a rapid manner to evolving situations. This does not mean the model is a failure. It means the model must adapt to take into consideration the need to adapt more quickly. This is where decentralized execution comes in, and why predictions that SDN will evolve into an overarching management system rather than an operational one are likely correct. There exist today, within the network, strategic points of control; locations within the data center architecture at which traffic (data) is aggregated, forcing all data to traverse, from which control over traffic and data is maintained. These locations are where decentralized execution can fulfill the “mission-based guidance” offered through centralized control. Certainly it is advantageous to both business and operations to centrally define and codify the operating parameters and goals of data center networking components (from L2 through L7), but it is neither efficient nor practical to assume that a single, centralized controller can achieve both managing and executing on the goals. What the military learned in its early attempts at air combat operations was that by relying on a single entity to make operational decisions in real time regarding the state of the mission on the ground, missions failed. Airmen, unable to dynamically adjust their actions based on current conditions, were forced to watch situations deteriorate rapidly while waiting for central command (controller) to receive updates and issue new orders. Thus, central command (controller) has moved to issuing mission or effects-based objectives and allowing the airmen (strategic points of control) to execute in a way that achieves those objectives, in whatever way (given a set of constraints) they deem necessary based on current conditions. This model is highly preferable (and much more feasible given today’s technology) than the one proffered today by SDN. It may be that such an extended model can easily be implemented by distributing a number of controllers throughout the network and federating them with a policy-driven control system that defines the mission, but leaves execution up to the distributed control points – the strategic control points. SDN is new, it’s exciting, it’s got potential to be the “next big thing.” Like all nascent technology and models, it will go through some evolutionary massaging as we dig into it and figure out where and why and how it can be used to its greatest potential and organizations’ greatest advantage. One thing we don’t want to do is replicate erroneous strategies of the past. No network model abrogating all control over execution has every really worked. All successful models have been a distributed, federated model in which control may be centralized, but execution is decentralized. Can we improve upon that? I think SDN does in its recognition that static configuration is holding us back. But it’s decision to reign in all control while addressing that issue may very well give rise to new issues that will need resolution before SDN can become a widely adopted model of networking. QoS without Context: Good for the Network, Not So Good for the End user Cyclomatic Complexity of OpenFlow-Based SDN May Drive Market Innovation SDN, OpenFlow, and Infrastructure 2.0 OpenFlow/SDN Is Not A Silver Bullet For Network Scalability Prediction: OpenFlow Is Dead by 2014; SDN Reborn in Network Management OpenFlow and Software Defined Networking: Is It Routing or Switching ? Cloud Security: It’s All About (Extreme Elastic) Control Ecosystems are Always in Flux The Full-Proxy Data Center Architecture373Views0likes0CommentsPerformance vs Reliability
Our own Deb Allen posted a greattech tipyesterday on conditional logic using HTTP::retry with iRules. This spawned a rather heated debate between Don and myself on the importance of performance versus reliability and application delivery, specifically with BIG-IP. Performance is certainly one of the reasons for implementing an application delivery network with an application delivery controller like BIG-IP as its foundation.As an application server becomes burdened by increasing requests and concurrent users, it can slow down as it tries to balance connection management with execution of business logic with parsing data with executing queries against a database with making calls to external services. Application servers, I think, have got to be the hardest working software in IT land. As capacity is reached, performance is certainly the first metric to slide down the tubes, closely followed by reliability. First the site is just slow, then it becomes unpredictable - one never knows if a response will be forthcoming or not. While slow is bad, unreliable is even worse. At this point it's generally the case that a second server is brought in and an application deliverysolution implemented in order to distribute the load and improve both performance and reliability. This is where F5 comes in, by the way. One of the things we at F5 are really good at doingis ensuring reliability while maintaining performance. One of the ways we can do that is through the use of iRules and specifically the HTTP::retry call. Using this mechanism we can ensure that even if one of the application servers fails to respond or responds with the dreaded 500 Internal Server Error, we can retry the request to another server and obtain a response for the user. As Deb points out (and Don pointed out offline vehemently as well) this mechanism [HTTP::retry] can lead to increased latency. In layman's terms, it slows down the response to the customer because we had to make another call to a second server to get a response on top of the first call. But assuming the second call successfully returns a valid response, we have at a minimum maintained reliability and served the user. So the argument is: which is more important, reliability or speed? Well, would you rather it take a bit longer to drive to your destinationor run into the ditch really fast? The answer is up to you, of course, and it depends entirely on your business and what it is you are trying to do. Optimally, of course, we would want the best of both worlds: we want fast and reliable. Using an iRule gives you the best of both worlds. In most cases, that second call isn't going to happen because it's likely BIG-IP has already determined the state of the servers and won't send a request to a server it knows is responding either poorly or not at all. That's why we implement advanced health checking features. But in the event that the server began to fail (in the short interval between health checks) and BIG-IP sent the request on before noticing the server was down, the iRule lets you ensure that the response can be resent to another server with minimal effort. Think of this particulariRule as a kind of disaster recovery plan - it will only be invoked if a specific sequence of events that rarely happens happens. It's reliability insurance.The mere existence of the iRule doesn't impact performance, and latency is only added if the retry is necessary, which should be rarely, if ever. So the trade off between performance and reliability isn't quite as cut and dried as it sounds. You can have both, that's the point of including an application delivery controller as part of your infrastructure. Performance is not at all at odds with reliability, at least not in BIG-IP land, and should never be conflicting goals regardless of your chosen application deliverysolution. Imbibing: Pink Lemonade Technorati tags: performance, MacVittie, F5, BIG-IP, iRule, reliability, application delivery254Views0likes1CommentIntegration Topologies and SDN
The scalability issue with the #SDN model today isn’t that the data plane that won’t scale …it’s issues with the control plane. Reading “OpenFlow/SDN Is Not A Silver Bullet For Network Scalability” brought to light an important note that must be made regarding scalability and networks, especially when we start talking about the control plane. It isn’t that the network itself won’t scale well with SDN, the concern is – or should be – on the control side, on whether or not the integration of the control plane with the data plane will scale. A core characteristic of SDN is not only the separation of the control and data planes, but that that the control plane is centralized. There can be only one. The third characteristic that is important to SDN is the integration of these decoupled data plane devices with the control plane via APIs (Mike Fratto does an excellent job of discussing the importance of API support as well as making the very important distinction between API and SDK in his recent blog, “Three Signs of SDN Support to Watch for from Vendors”, so I won’t belabor this point right now). The convergence of these three characteristics results in what Enterprise Application Integration (EAI) has long known as a “hub and spoke” integration pattern. A hub – in the case of SDN, the controller – sits in the middle of a set of systems – in the case of SDN, the data plane devices – and is the center of the universe. The problem with this pattern, and why bus topologies rose to take its place, is that it doesn’t scale well. There is always only one central node, and it must necessarily manage and communicate with every other node in the integration. While hub-and-spoke, which grows linearly, isn’t nearly as difficult to scale as its predecessor the spaghetti (mesh) pattern, which grows exponentially, in a network growing even linearly is going to be problematic for some value of n (where n is the number of edges, i.e. nodes, in the network). At some value of n, it becomes apparent that the controller (hub) must be able to scale, too. Scaling up would require expansion of the system upon which the controller is deployed, which may require replacement. You can imagine the reluctance of operations to essentially shut down the entire network while that occurs. The other option is to scale out, vis à vis traditional methods of scaling other systems, via a load balancing service and duplicate instances. This implies a shared-something architecture, usually describes as being the database or repository for policy from which nodes are “programmed” by the controller. This appears to be the response in existing implementations, with “clusters of controllers” providing the scale and resiliency required. So scaling out the controller, then becomes an exercise in traditional scalability methods used to scale out client-server architectures. So the Control Plane of an SDN Can Scale. What’s the Problem then? As pointed out by Ivan Pepelnjak in “OpenFlow/SDN Is Not A Silver Bullet For Network Scalability”, the problem with this model appears to be response time. Failure in a node cannot be addressed fast enough by a centralized software system, particularly not one that relies on a database (which has its own scalability issues, of course). There are several questions that must be answered in order to even deal with failure that pose some interesting performance and scaling challenges. How does the controller know that an element node is “down”? Is it polling, which introduces an entirely new concern regarding the level of monitoring noise on the network interfering with business-related traffic? Is it monitoring a persistent control-channel connection between the controller and the node? Certainly this would indicate nearly instantaneously the status of the node, but introduces scaling challenges as maintaining even a one-to-one control-channel connection per element node in the network would consume large quantities of memory (and ultimately have a negative impact on performance, requiring scale out much sooner than may be otherwise necessary). Does a neighbor or upstream element tattle on the downed node when it doesn’t respond? There are a variety of mechanisms that could be used to monitor the network such that the controller is informed of a failure, but each brings with it new challenges and has different responsiveness profiles. Polling is tricky, as any load balancing provider will tell you, because it’s based on a timed interval. Persistent connections, as noted earlier, bring scalability challenges back to the table. Tattle-tale methodologies are unreliable, requiring that a neighbor or upstream element have the need to “talk to” the downed down before notification can occur, leaving open the possibility of a downed node going unnoticed until it’s too late. How does the controller respond to a downed element node? Obviously the controller needs to “route around” or “detour” traffic until a replacement can be deployed (virtually or physically). This no doubt requires some calculations to determine the best route (OSPF anyone?) if done in real-time. Some have suggested alternative routes in tables be available on each node in the event of a failure, a model more closely related to today’s existing routing practices and one that would certainly respond much better to failure in the network than would a system in which the controller must discover and reconfigure the network to adjust to failures. What happens to existing flows when an element node fails? Ah, the age old stateful failure challenge. This is one that is (almost) solved with redundant architectures that mirror sessions (flows) to a secondary device. The problem is that these models work best, i.e. have the highest levels of success, for full-proxy devices, particularly when the flow supports stateful/connection-oriented protocols. These questions are nothing new to experienced EAI practitioners who’ve had to suffer through a hub-and-spoke based integration effort. Failure in a node or of the controller give rise to painful fire-drill exercises, the likes of which no one really enjoys because they are highly disruptive. They’re also not really new questions for those with a long history in load balancing and high availability architectures. Still, these are questions which need to be answered in the context of the network, which has somewhat different uptime and performance requirements than even applications. Ultimately the answer is going to lie in architecture, and it’s unlikely that what results will be a single, centrally controlled one. QoS without Context: Good for the Network, Not So Good for the End user Applying ‘Centralized Control, Decentralized Execution’ to Network Architecture F5 Friday: ADN = SDN at Layer 4-7 SDN, OpenFlow, and Infrastructure 2.0 Cyclomatic Complexity of OpenFlow-Based SDN May Drive Market Innovation Five needs driving SDNs243Views0likes0CommentsWhen Black Boxes Fail: Amazon, Cloud and the Need to Know
If Amazon’s Availability Zone strategy had worked as advertised its outage would have been non-news. But then again, no one really knows what was advertised… There’s been a lot said about the Amazon outage and most of it had to do with cloud and, as details came to light, about EBS (Elastic Block Storage). But very little mention was made of what should be obvious: most customers didn’t – and still don’t - know how Availability Zones really work and, more importantly, what triggers a fail over. What’s worse, what triggers a fail back? Amazon’s documentation is light. Very light. Like cloud light. Availability Zones are distinct locations that are designed to be insulated from failures in other zones. This allows you to protect your applications from possible failures in a single location. Availability Zones also provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. -- Announcement: New Amazon EC2 Availability Zone in US East Now, it’s been argued that part of the reason Amazon’s outage was so impactful was a lack of meaningful communication on the part of Amazon regarding the outage. And Amazon has admitted to failing in that regard and has promised to do better in the future. But just as impactful is likely the lack of communication before the outage; before deployment. After all, the aforementioned description of Availability Zones, upon which many of the impacted customers were relying to maintain availability, does not provide enough information to understand how Availability Zones work, nor how they are architected. TRIGGERING FAIL-OVER Scrounging around the Internet turns up very little on how Availability Zones work, other than isolating instances from one another and being physically independent for power. In the end, what was promised was a level of isolation that would mitigate the impact of an outage in one zone. Turns out that didn’t work out so well but more disconcerting is that there is still no explanation regarding what kind of failure – or conditions – result in a fail over from one Availability Zone to another. Global Application Delivery Services (the technology formerly known as Global Server load balancing) is a similar implementation generally found in largish organizations. Global application delivery can be configured to “fail over” in a variety of ways, based on operational or business requirements, with the definition of “failure” being anything from “OMG the data center roof fell in and crushed the whole rack!” to “Hey, this application is running a bit slower than it should, can you redirect folks on the east coast to another location? Kthanxbai.” It allows failover in the event of failure and redirection based on failure to meet operational and business goals. It’s flexible and allows the organization to define “failure” based on its needs and on its terms. But no such information appears to exist for Amazon’s Availability Zones and we are left to surmise that it’s like based on something more akin to the former rather than the latter given their rudimentary ELB (Elastic Load Balancing) capabilities. When organizations architect and subsequently implement disaster recovery or high-availability initiatives – and make no mistake, that’s what using multiple Availability Zones on Amazon is really about – they understand the underlying mechanism and triggers that cause a “failover” from one location to another. This is the critical piece of information, of knowledge, that’s missing. In an enterprise-grade high-availability architecture it is often the case that such triggers are specified by both business and operational requirements and may include performance degradation as well as complete failure. Such triggers may be based on a percentage of available resources, or other similar resource capacity constraints. Within an Amazon Availability Zone apparently the trigger is a “failure”, but a failure of what is left to our imagination. But also apparently missing was a critical step in any disaster recovery/high availability plan: testing. Enterprise IT not only designs and implements architectures based on reacting to a failure, but actually tests that plan. It’s often the case that such plans are tested on a yearly basis, just to ensure all the moving parts still work as expected. Relying on Amazon – or any cloud computing environment in which resources are shared – makes it very difficult to test such a plan. After all, in order to test failover from one Availability Zone to another Amazon would have to forcibly bring down an Availability Zone – and every application deployed within it. Consider how disruptive that process might be if customers started demanding such tests on their schedule. Obviously this is not conducive to Amazon maintaining its end of the uptime bargain for those customers not deployed in multiple Availability Zones. AVAILBILITY and RELIABILITY: CRITICAL CONCERNS BORNE OUT by AMAZON FAILURE Without the ability to test such plans, we return to the core issue – trust. Organizations relying wholly on cloud computing providers must trust the provider explicitly. And that generally means someone in the organization must understand how things work. Black boxes should not be invisible boxes, and the innate lack of visibility into the processes and architecture of cloud computing providers will eventually become as big a negative as security was once perceived to be. Interestingly enough, a 2011 IDG report on global cloud computing adoption shows that high performance (availability and reliability) are the most important – 5% higher than security. Amazon’s epic failure will certainly do little to alleviate concerns that public cloud computing is not ready for mission critical applications. Now, it’s been posited that customers were at fault for trusting Amazon in the first place. Others laid the blame solely on the shoulders of Amazon. Both are to blame, but Amazon gets to shoulder a higher share of that blame if for no other reason than it failed to provide the information necessary for customers to make an informed choice regarding whether or not to trust their implementation. This has been a failing of Amazon’s since it first began offering cloud computing services – it has consistently failed to offer the details necessary for customers to understand how basic processes are implemented within its environment. And with no ability to test failover across Availability Zones, organizations are at fault for trusting a technology without understanding how it works. What’s worse, many simply didn’t even care – until last week. Now it may be the case that Amazon is more than willing to detail such information to customers; that it has adopted a “need to know” policy regarding its architecture and implementation and its definition of “failure.” If that is the case, then it behooves customers to ask before signing on the dotted line. Because customers do need to know the details to ensure they are comfortable with the level of reliability and high-availability being offered. If that’s not the case or customers are not satisfied with the answers, then it behooves them to – as has been suggested quite harshly by many cloud pundits – implement alternative plans that involve more than one provider. A massive failure on the part of a public cloud computing provider was bound to happen eventually. If not Amazon than Azure, if not Azure then it would have been someone else. What’s important now is take stock and learn from the experience – both providers and customers – such that a similar event does not result in the same epic failure again. Two key points stand out: The need to understand how services work when they are provided by a cloud computing provider. Black box mentality is great marketing (hey, no worries!) but in reality it’s dangerous to the health and well-being of applications deployed in such environments because you have very little visibility into what’s really going on. The failure to understand how Amazon’s Availability Zones actually worked – and exactly what constituted “isolation” aside from separate power sources as well as what constitutes a “failure” – lies strictly on the customer. Someone within the organization needs to understand how such systems work from the bottom to the top to ensure that such measures meet requirements. The need to implement a proper disaster recovery / high availability architecture and test it. Proper disaster recovery / high availability architectures are driven by operational goals which are driven by business requirements. A requirement for 100% uptime will likely never be satisfied by a single site, regardless of provider claims. And if you can’t test it but need to guarantee high uptime and can’t get the details necessary to understand – and trust – the implementation to provide that uptime, perhaps it’s not the right choice in the first place. Or perhaps you just need to demand an explanation. Bye, Bye My Clustered AMIs…A Cloud Tribute to Don McLean Amazon Web Services apologizes, promises better communication, transparency Cloud Failure, FUD, and The Whole AWS Oatage… They're Called Black Boxes Not Invisible Boxes Dynamic Infrastructure: The Cloud within the Cloud How to Earn Your Data Center Merit Badge Disaster Recovery: Not Just for Data Centers Anymore Cloud Control Does Not Always Mean ‘Do it yourself’ Control, choice, and cost: The conflict in the cloud234Views0likes0CommentsF5 Friday: Ops First Rule
#cloud #microsoft #iam “An application is only as reliable as its least reliable component” It’s unlikely there’s anyone in IT today that doesn’t understand the role of load balancing to scale. Whether cloud or not, load balancing is the key mechanism through which load is distributed to ensure horizontal scale of applications. It’s also unlikely there’s anyone in IT that doesn’t understand the relationship between load balancing and high-availability (reliability). High-Availability (HA) architectures are almost always implemented using load balancing services to ensure seamless transition from one service instance to another in the event of a failure. What’s often overlooked is that scalability and HA isn’t important just for applications. Services – whether application or network-focused – must also be reliable. It’s the old “only as strong as the weakest link in the chain” argument. An application is only as reliable as its least reliable component – and that includes services and infrastructure upon which that application relies. It is – or should be – ops first rule; the rule that guides design of data center architectures. This requirement becomes more and more obvious as emerging architectures combining the data center and cloud computing are implemented, particularly when federating identity and access services. That’s because it is desirable to maintain control over the identity and access management processes that authenticate and authorize use of applications no matter where they may be deployed. Such an architecture relies heavily on the corporate identity store as the authoritative source of both credentials and permissions. This makes the corporate identity store a critical component in the application dependency chain, one that must necessarily be made as reliable as possible. Which means you need load balancing. A good example of how this architecture can be achieved is found in BIG-IP load balancing support for Microsoft’s Active Directory Federation Services (AD FS). AD FS and F5 Load Balancing Microsoft’s Active Directory Federation Services, (AD FS) sever role is an identity access solution that extends the single sign-on, (SSO) experience for directory-authenticated clients, (typically provided on the Intranet via Kerberos), to resources outside of the organization’s boundaries, such as cloud computing environments. To ensure high-availability, performance, and scalability the F5 BIG-IP Local Traffic Manager (LTM) can be deployed to load balance an AD FS server farm. There are several scenarios in which BIG-IP can load balance AD FS services. 1. To enable reliability of AD FS for internal clients accessing external resources, such as those hosted in Microsoft Office 365. This is the simplest of architectures and the most restrictive in terms of access for end-users as it is limited to only internal clients. 2. To enable reliability of AD FS and AD FS proxy servers, which provide external end-user SSO access to both internal federation-enabled resources as well as partner resources like Microsoft Office 365. This is a more flexible option as it serves both internal and external clients. 3. BIG-IP Access Policy Manager (APM) can replace the need for AD FS proxy servers required for external end-user SSO access, which eliminates another tier and enables pre-authentication at the perimeter, offering both the flexibility required (supporting both internal and external access) as well as a more secure deployment. In all three scenarios, F5 BIG-IP serves as a strategic point of control in the architecture, assuring reliability and performance of services upon which applications are dependent, particularly those of authentication and authorization. Using BIG-IP APM instead of AD FS proxy servers both simplifies and makes more agile the architecture. This is because BIG-IP APM is inherently more programmable and flexible in terms of policy creation. BIG-IP APM, being deployed on the BIG-IP platform, can take full advantage of the context in which requests are made, ensuring that identity and access control go beyond simple credentials and take into consideration device, location, and other contextual-clues that enable a more secure system of authentication and authorization. High-availability – and ultimately scalability - is preserved for all services by leveraging the core load balancing and HA functionality of the BIG-IP platform. All components in the chain are endowed with HA capabilities, making the entire application more resilient and able to withstand minor and major failures. Using BIG-IP LTM for load balancing AD FS serves as an adaptable and extensible architectural foundation for a phased deployment approach. As a pilot phase, rolling out AD FS services for internal clients only makes sense, and is the simplest in terms of its implementation. Using BIG-IP as the foundation for such an architecture enables further expansion in subsequent phases, such as introducing BIG-IP APM in a phase two implementation that brings flexibility of access location to the table. Further enhancements can then be made regarding access when context is included, enabling more complex and business-focused access policies to be implemented. Time-based restrictions on clients or location can be deployed and enforced, as is desired or needed by operations or business requirements. Reliability is a Least Common Factor Problem Reliability must be enabled throughout the application delivery chain to ultimately ensure reliability of each application. Scalability is further paramount for those dependent services, such as identity and access management, that are intended to be shared across multiple applications. While certainly there are many other load balancing services that could be used to enable reliability of these services, an extensible and highly scalable platform such as BIG-IP is required to ensure both reliability and scalability of shared services upon which many applications rely. The advantage of a BIG-IP-based application delivery tier is that its core reliability and scalability services extend to any of the many services that can be deployed. By simplifying the architecture through application delivery service consolidation, organizations further enjoy the benefits of operational consistency that keeps management and maintenance costs reduced. Reliability is a least common factor problem, and Ops First Rule should be applied when designing a deployment architecture to assure that all services in the delivery chain are as reliable as they can be. F5 Friday: BIG-IP Solutions for Microsoft Private Cloud BYOD–The Hottest Trend or Just the Hottest Term The Four V’s of Big Data Hybrid Architectures Do Not Require Private Cloud The Cost of Ignoring ‘Non-Human’ Visitors Complexity Drives Consolidation What Does Mobile Mean, Anyway? At the Intersection of Cloud and Control… Cloud Bursting: Gateway Drug for Hybrid Cloud Identity Gone Wild! Cloud Edition219Views0likes0CommentsThe Colonial Data Center and Virtualization
No, not colonial as in Battlestar Gallactica or the British Empire, colonial as in corals and weeds and virtual machines I was out pulling weeds this summer – Canada thistle to be exact – and was struck by how much its root system reminded me of Cnidaria (soft corals to those of you whose experience with aquaria remains relegated to suicidal goldfish). Canada thistle is difficult to control because of its extensive root system. Pulling a larger specimen you often find yourself pulling up its root, only to find it connected to three, four or more other specimens. Cnidaria reproduce in a similar fashion, sharing a “root” system that enables them to share resources. Unlike thistles, however, Cnidaria has several different growth forms. There’s a traditional colonial form that resembles thistles – a single, shared long root with various specimens popping up along the path – and one that may be familiar to folks who’ve seen Finding Nemo: a tree formation in which the root branches not only horizontally but vertically, with individual specimens forming upwards along the branch in what gives it a tree-like appearance. Cnidaria produce a variety of colonial forms, each of which is one organism but consists of polyp-like zooids. The simplest is a connecting tunnel that runs over the substrate (rock or seabed) and from which single zooids sprout. In some cases the tunnels form visible webs, and in others they are enclosed in a fleshy mat. More complex forms are also based on connecting tunnels but produce "tree-like" groups of zooids. The "trees" may be formed either by a central zooid that functions as a "trunk" with later zooids growing to the sides as "branches", or in a zig-zag shape as a succession of zooids, each of which grows to full size and then produces a single bud at an angle to itself. In many cases the connecting tunnels and the "stems" are covered in periderm, a protective layer of chitin. [6] Some colonial forms have other specialized types of zooid, for example, to pump water through their tunnels. [12] -- Wikipedia, Cnidaria Of course, both thistle and Cnidaria and the notion of colonial inter-dependence is one that’s shared by the data center. Virtual machines deployed on the same physical host replicate in many ways the advantages and disadvantages of a Cnidarian tree-formation. The close proximity of the 15.6 average VMs per host (according to Vkernel VMI 2012) allows them to share the “local” (virtual) network, which eliminates many of the physical sources of network latency that occur naturally in the data center. But it also means that a failure in the physical network connecting them to the network backbone is catastrophic for all VMs on a given host. Which is why you want to pay careful attention to placement of VMs in a dynamic data center. The concept of pulling compute resources from anywhere in the data center to support scalability on-demand is a tantalizing one, but doing so can have disastrous results in the event of a catastrophic failure in the network. Architecture and careful planning is necessary to ensure that resources do not end up grouped in such a way that a failure in one negatively impacts the entire application. Proximity must be considered as part of a fault isolation strategy, which is a requirement when resources are loosely – if at all – coupled to specific locations within the data center. Referenced blogs & articles: Wikipedia, Cnidaria Virtualization Management Index: Issues 1 and 2 Back to Basics: Load balancing Virtualized Applications Digital is Different The Cost of Ignoring ‘Non-Human’ Visitors Cloud Bursting: Gateway Drug for Hybrid Cloud The HTTP 2.0 War has Just Begun Why Layer 7 Load Balancing Doesn’t Suck Network versus Application Layer Prioritization Complexity Drives Consolidation Performance in the Cloud: Business Jitter is Bad209Views0likes0CommentsThe Number of the Counting Shall be Three (Rules of Thumb for Application Availability)
Three shall be the number thou shalt count, and the number of the counting shall be three. If you’re concerned about maintaining application availability, then these three rules of thumb shall be the number of the counting. Any less and you’re asking for trouble. I like to glue animals to rocks and put disturbing amounts of electricity and saltwater NEXT TO EACH OTHER Last week I was checking out my saltwater reef when I noticed water lapping at the upper edges of the tank. Yeah, it was about to overflow. Somewhere in the system something had failed. Not entirely, but enough to cause the flow in the external sump system to slow to a crawl and the water levels in the tank to slowly rise. Troubleshooting that was nearly as painful as troubleshooting the cause of application downtime. As with a data center, there are ingress ports and egress ports and inline devices (protein skimmers) that have their own flow rates (bandwidth) and gallons per hour processing capabilities (capacity) and filtering (security). When any one of these pieces of the system fails to perform optimally, well, the entire system becomes unreliable, instable, and scary as hell. Imagine a hundred or so gallons of saltwater (and all the animals inside) floating around on the floor. Near electricity. The challenges to maintaining availability in a marine reef system are similar to those in an application architecture. There are three areas you really need to focus on, and you must focus on all three because failing to address any one of them can cause an imbalance that may very well lead to an epic fail. RELIABILITY Reliability is the cornerstone of assuring application availability. If the underlying infrastructure – the hardware and software – fails, the application is down. Period. Any single point of failure in the delivery chain – from end-to-end – can cause availability issues. The trick to maintaining availability, then, is redundancy. It is this facet of availability where virtualization most often comes into play, at least from the application platform / host perspective. You need at least two instances of an application, just in case. Now, one might think that as long as you have the capability to magically create a secondary instance and redirect application traffic to it if the primary application host fails that you’re fine. You’re not. Creation, boot, load time…all impact downtime and in some cases, every second counts. The same is true of infrastructure. It may seem that as long as you could create, power up, and redirect traffic to a virtual instance of a network component that availability would be sustained, but the same timing issues that plague applications will plague the network, as well. There really is no substitute for redundancy as a means to ensure the reliability necessary to maintain application availability. Unless you find prescient, psychic components (or operators) capable of predicting an outage at least 5-10 minutes before it happens. Then you’ve got it made. Several components are often overlooked when it comes to redundancy and reliability. In particular, internet connectivity is often ignored as a potential point of failure or, more often the case, it is viewed as one of those “things beyond our control” in the data center that might cause an outage. Multiple internet connections are expensive, understood. That’s why leveraging a solution like link load balancing makes sense. If you’ve got multiple connections, why not use them both and use them intelligently – to assist in efforts to maintain/improve application performance or prioritize application traffic in and out of the data center. Doing so allows you to assure availability in the event that one connection fails, yet the connection never sits idle when things are all hunky dory in the data center. The rule of thumb for reliability is this: Like Sith lords, there should always be two of everything with automatic failover to the secondary if the primary fails (or is cut down by a Jedi knight). CAPACITY The most common cause of downtime is probably a lack of capacity. Whether it’s due to a spike in usage (legitimate or not) or simply unanticipated growth over time, a lack of compute resources available across the application infrastructure tiers is usually the cause of unexpected downtime. This is certainly one of the drivers for cloud computing and rapid provisioning models – external and internal – as it addresses the immediacy of need for capacity upon availability failures. This is particularly true in cases where you actually have the capacity – it just happens to reside physically on another host. Virtualization and cloud computing models allow you to co-opt that idle capacity and give it to the applications that need it, on-demand. That’s the theory, anyway. Reality is that there are also timing issues around provisioning that must be addressed but these are far less complicated and require fewer psychic powers than predicting total failure of a component. Capacity planning is as much art as science, but it is primarily based on real numbers that can be used to indicate when an application is nearing capacity. Because of this predictive power of monitoring and data, provisioning of additional capacity can be achieved before it’s actually needed. Even without automated systems for provisioning, this method of addressing capacity can be leveraged – the equations for when provisioning needs to begin simply change based on the amount of time needed to manually provision the resources and integrate it with the scalability solution (i.e. the Load balancer, the application delivery controller). The rule of thumb for capacity is this: Like interviews and special events, unless you’re five minutes early provisioning capacity you’re late. SECURITY Security – or lack thereof - is likely the most overlooked root cause of availability issues, especially in today’s hyper-connected environments. Denial of service attacks are just that, an attempt to deny service to legitimate users, and they are getting much harder to detect because they’ve been slowly working their way up the stack. Layer 7 DDoS attacks are particularly difficult to ferret out as they don’t necessarily have to even be “fast”, they just have to chew up resources. Consider the latest twist on the SlowLoris attack; the attack takes the form of legitimate POST requests that s-l-o-w-l-y feed data to the server, in a way that consumes resources but doesn’t necessarily set off any alarm bells because it’s a completely legitimate request. You don’t even need a lot of them, just enough to consume all the resources on web/application servers such that no one else can utilize them. Leveraging a full proxy intermediary should go quite a ways to mitigate this situation because the request is being fed to the intermediary, not the web/application servers, and the intermediary generally has more resources and is already well versed in dealing with very slow clients. Resources are not consumed on the actual servers and it would take a lot (generally hundreds of thousands to millions) of such requests to consume the resources on the intermediary. The reason such an attack works is because the miscreants aren’t using many connections, so it’s likely that in order to take out a site front-ended by such an intermediary enough connections to trigger an alert/notification would be necessary. Disclaimer:I have not tested such a potential solution so YMMV. In theory, based on how the attack works, the natural offload capabilities of ADCs should help mitigate this attack. But I digress, the point is that security is one of the most important facets of maintaining availability. It isn’t just about denial of service attacks, either, or even consuming resources. A well-targeted injection attack or defacement can cripple the database or compromise the web/application behavior such that the application no longer behaves as expected. It may respond to requests, but what it responds with is just as vital to “availability” as responding at all. As such, ensuring the integrity of application data and applications themselves is paramount to preserving application availability. The rule of thumb for security is this: If you build your security house out of sticks, a big bad wolf will eventually blow it down. Assuring application availability is a much more complex task than just making sure the application is running. It’s about ensuring enough capacity exists at the right time to scale on demand; it’s about ensuring that if any single component fails another is in place to take over, and it’s absolutely about ensuring that a lackluster security policy doesn’t result in a compromise that leads to failure. These three components are critical to the success of availability initiatives and failing to address any one of them can cause the entire system to fail. Related blogs & articles: Some Services are More Equal than Others Why Virtualization is a Requirement for Private Cloud Computing What is Network-based Application Virtualization and Why Do You Need It? Out, Damn’d Bot! Out, I Say! The Application Delivery Spell Book: Detect Invisible (Application) Stalkers WILS: How can a load balancer keep a single server site available? The New Distribution of The 3-Tiered Architecture Changes Everything What is a Strategic Point of Control Anyway? Layer 4 vs Layer 7 DoS Attack Putting a Price on Uptime204Views0likes0CommentsSingle Point of Failure Is Not Just a Hardware Issue
As those of you close to me know, I have an interest in historical military firearms. As a veteran, I of course have an interest in the M-16, since I used it, but otherwise my tastes run more to World War II era weapons of all countries. Interestingly, in the United States a law change in 1986 made it rather difficult to own a fully automatic weapon, and a change created by a special form of license allows collectors like myself to buy and own them under certain conditions, but the limitation that after 1986 no automatic weapon can be produced in the US or imported except to be sold to military and law-enforcement agencies, which makes the cost of pre-1986 automatic weapons prohibitive. The sense or nonsense of this series of laws is not the point of this blog, and you can guess, as a collector, where I fall on the issue. The real point is that this whole vast body of law stopped the production of M-16s for civilian use in the US… But it did not stop the creation of automatic M-16 style weapons for civilian use. Yes, you read that right. The civilian version of the M-16 is the AR-15. The primary difference between the two weapons is that the AR-15 is semi-automatic only. You can fire it as fast as you can pull the trigger, but you can’t empty a clip with a single pull. The AR-15 is still produced in some quantity in the US, and retails for $600 to $2000 USD depending upon brand and model. Shortly after the M-16 and other full-auto weapons were banned in the US, a simple workaround was created. Buy an AR-15, replace select parts with non-banned M-16 parts, and drop a thing called an auto-sear in. Nearly instant full-auto AR-15. The laws evolved, and at this point owning a drop-in auto sear AND an AR-15 is considered felony possession of an automatic weapon, and legal drop-in auto sears sell for many times the cost of an AR-15 because the auto-sear itself is now classified as a machine gun. Weird solution, but it solved the “problem” of people buying them and converting AR-15s. Typical drop-in auto sear - quarter-bore.com And the moral of the story? It’s the part you weren’t thinking about when you designed a solution that will get you every time. When you’re going through your “No single point of failure” routine, make sure you account for every bit of software that the system relies upon. Most of the time “single point of failure” is seen as the hardware that provides ingress to the application, the core application itself, and the database. But most applications these days have external links – we’ll call them auto-sears – to web services, SOA enabled legacy apps, external file systems, ADS… The list goes on and on. This post is just a friendly reminder to diagram all dependencies and make certain that your “no single point of failure” architecture doesn’t have an auto-sear loophole in the software architecture side. The law had the time to evolve when the auto-sear loophole became obvious, you won’t have that luxury, because a single point of failure you don’t cover in your architectural planning will come back to bite you in emergency mode. There are tools to help you figure out what your dependencies are, even a simple link-checker will give you a list of URLs that your site includes, though going through them to find critical paths that could bring your site down is a manual job… It’s better than not figuring this stuff out ahead of time. Some of you are going “that’s silly, everyone knows that”. To which I calmly respond “Intellectually, yes. But most organizations have at-risk applications that have not had their entire software architecture reviewed. Either there is a belief at the time the application was deployed that it shouldn’t operate without critical piece X (which may or may not be a correct belief) or there wasn’t time to go through everything because we’re constantly doing more with less.” Either way, don’t wait until it is an emergency, go find those auto-sears and close the loopholes now. As virtualization and cloud computing continue to make inroads in the enterprise, you’ll have the perfect opportunity to identify some of these issues when you move the applications. If they fail due to failed access to some required resource, check and make certain the resource is not a single point of failure, then fix the original problem. And no, I don’t own either an AR-15 or an M-16. I’ve looked at AR-15s, but my preferred style is one that’s not common (the A1 variant for those in-the-know), so I’ve not yet found one I was willing to purchase. I have been bitten by the software single-point-of-failure issue though. I wish I could say “just once”.204Views0likes0CommentsDamned if you do, damned if you don't
There has been much fervor around the outages of cloud computing providers of late, which seems to be leading to an increased and perhaps unwarranted emphasis on SLAs the likes of which we haven't seen since...well, the last time the IT saw outsourced anything reach the hype-level of cloud computing. Consider this snippet of goodness for a moment, and pay careful attention to the last paragraph. From Five Key Challenges of Enterprise Cloud Computing I won’t beat the dead “Gmail down, EC2 down, etc down” horse here. But the truth of the matter is enterprises today cannot reasonably rely on the cloud infrastructures/platforms to run their business. There’s almost no SLAs provided by the cloud providers today. Even Jeff Barr from Amazon said that AWS only provides SLA for their S3 service. [...] Can you imagine enterprises signing up cloud computing contracts without SLAs clearly defined? It’s like going to host their business critical infrastructure in a data center that doesn’t have clearly defined SLA. We all know that SLAs really doesn’t buy you much. In most cases, enterprises get refunded for the amount of time that the network was down. No SLA will cover business loss. However, as one of the CSOs I met said, it’s about risk transfer. As long as there’s a defined SLA on paper, when the network/site goes down, they can go after somebody. If there’s no SLA, it will be the CIO/CSO’s head that’s on the chopping block. Let's look at this rationally for a moment. SLAs really don't buy you much. True. True of cloud computing providers, true of the enterprise. No SLA covers business loss. True. True of cloud computing providers, true of the enterprise. What I find amusing about this article is that the author asks if we can imagine "signing up cloud computing contracts without SLAs clearly defined?" Well, why not? Businesses do it every day when IT deploys the latest "Business App v4.5.3.2a". Microsoft Office 2007 relies heavily on on-line components, but we don't demand an SLA from Microsoft for it. Likewise, the anti-phishing capabilities of IE7 don't necessarily come with an SLA and businesses don't shy away from making it their corporate standard anyway. In fact, I'd argue that most cloudware today comes with an anti-SLA: use at your own risk, we don't guarantee anything. The CIO/CSO's head is on the chopping block if he does have an SLA, because there's no guarantee that IT can meet it. Oh, usually they do, because the SLA is broadly defined for all of IT in terms of "we'll have 5 9's of availability for the network" and "applications will have less than an X second response time" and so on. But it isn't as if IT and the business sit down and negotiate SLAs for every single application they deploy into the enterprise data center. If they do, then they're the exception, not the rule. And the applications this is true of are so time-sensitive and mission critical that it's unlikely the responsibility for them will ever be outsourced. Financial services and brokerages are a good example of this. Outsourced? Unlikely. The IT folks responsible for the applications and networks in those industries are probably laughing uproariously at the idea. The argument that an SLA is simply to place a target on someone's head regarding responsibility for uptime and performance of applications is largely true. But that would seem to indicate that if you're a CIO/CSO and can wrangle any SLA out of a cloud computing provider that you should immediately use them for everything, because you can pass the mantle of responsibility for failing to meet SLAs to them instead of shouldering it yourself. This isn't a cloud computing problem, this is a problem of responsibility and managing expectations. It's a problem with expecting that a million moving parts, hundreds of connections, routers, switches, intermediaries, servers, operating systems, libraries, and applications will somehow always manage to be available. Unpossible, I say, and unrealistic regardless of whether we're talking cloud computing or enterprise infrastructure. Basically, the CIO/CSO is damned if he has an SLA because chances are IT is going to fail to meet them at some point, and he's damned if he doesn't have an SLA because that means he's solely responsible for the reliability and performance of all of IT. And people wonder why C-level execs command the compensation levels they do. It's to make sure they can afford the steady stream of antacids they need just to get through the day.199Views0likes1Comment