reliability
16 TopicsThe Colonial Data Center and Virtualization
No, not colonial as in Battlestar Gallactica or the British Empire, colonial as in corals and weeds and virtual machines I was out pulling weeds this summer – Canada thistle to be exact – and was struck by how much its root system reminded me of Cnidaria (soft corals to those of you whose experience with aquaria remains relegated to suicidal goldfish). Canada thistle is difficult to control because of its extensive root system. Pulling a larger specimen you often find yourself pulling up its root, only to find it connected to three, four or more other specimens. Cnidaria reproduce in a similar fashion, sharing a “root” system that enables them to share resources. Unlike thistles, however, Cnidaria has several different growth forms. There’s a traditional colonial form that resembles thistles – a single, shared long root with various specimens popping up along the path – and one that may be familiar to folks who’ve seen Finding Nemo: a tree formation in which the root branches not only horizontally but vertically, with individual specimens forming upwards along the branch in what gives it a tree-like appearance. Cnidaria produce a variety of colonial forms, each of which is one organism but consists of polyp-like zooids. The simplest is a connecting tunnel that runs over the substrate (rock or seabed) and from which single zooids sprout. In some cases the tunnels form visible webs, and in others they are enclosed in a fleshy mat. More complex forms are also based on connecting tunnels but produce "tree-like" groups of zooids. The "trees" may be formed either by a central zooid that functions as a "trunk" with later zooids growing to the sides as "branches", or in a zig-zag shape as a succession of zooids, each of which grows to full size and then produces a single bud at an angle to itself. In many cases the connecting tunnels and the "stems" are covered in periderm, a protective layer of chitin. [6] Some colonial forms have other specialized types of zooid, for example, to pump water through their tunnels. [12] -- Wikipedia, Cnidaria Of course, both thistle and Cnidaria and the notion of colonial inter-dependence is one that’s shared by the data center. Virtual machines deployed on the same physical host replicate in many ways the advantages and disadvantages of a Cnidarian tree-formation. The close proximity of the 15.6 average VMs per host (according to Vkernel VMI 2012) allows them to share the “local” (virtual) network, which eliminates many of the physical sources of network latency that occur naturally in the data center. But it also means that a failure in the physical network connecting them to the network backbone is catastrophic for all VMs on a given host. Which is why you want to pay careful attention to placement of VMs in a dynamic data center. The concept of pulling compute resources from anywhere in the data center to support scalability on-demand is a tantalizing one, but doing so can have disastrous results in the event of a catastrophic failure in the network. Architecture and careful planning is necessary to ensure that resources do not end up grouped in such a way that a failure in one negatively impacts the entire application. Proximity must be considered as part of a fault isolation strategy, which is a requirement when resources are loosely – if at all – coupled to specific locations within the data center. Referenced blogs & articles: Wikipedia, Cnidaria Virtualization Management Index: Issues 1 and 2 Back to Basics: Load balancing Virtualized Applications Digital is Different The Cost of Ignoring ‘Non-Human’ Visitors Cloud Bursting: Gateway Drug for Hybrid Cloud The HTTP 2.0 War has Just Begun Why Layer 7 Load Balancing Doesn’t Suck Network versus Application Layer Prioritization Complexity Drives Consolidation Performance in the Cloud: Business Jitter is Bad214Views0likes0CommentsIntegration Topologies and SDN
The scalability issue with the #SDN model today isn’t that the data plane that won’t scale …it’s issues with the control plane. Reading “OpenFlow/SDN Is Not A Silver Bullet For Network Scalability” brought to light an important note that must be made regarding scalability and networks, especially when we start talking about the control plane. It isn’t that the network itself won’t scale well with SDN, the concern is – or should be – on the control side, on whether or not the integration of the control plane with the data plane will scale. A core characteristic of SDN is not only the separation of the control and data planes, but that that the control plane is centralized. There can be only one. The third characteristic that is important to SDN is the integration of these decoupled data plane devices with the control plane via APIs (Mike Fratto does an excellent job of discussing the importance of API support as well as making the very important distinction between API and SDK in his recent blog, “Three Signs of SDN Support to Watch for from Vendors”, so I won’t belabor this point right now). The convergence of these three characteristics results in what Enterprise Application Integration (EAI) has long known as a “hub and spoke” integration pattern. A hub – in the case of SDN, the controller – sits in the middle of a set of systems – in the case of SDN, the data plane devices – and is the center of the universe. The problem with this pattern, and why bus topologies rose to take its place, is that it doesn’t scale well. There is always only one central node, and it must necessarily manage and communicate with every other node in the integration. While hub-and-spoke, which grows linearly, isn’t nearly as difficult to scale as its predecessor the spaghetti (mesh) pattern, which grows exponentially, in a network growing even linearly is going to be problematic for some value of n (where n is the number of edges, i.e. nodes, in the network). At some value of n, it becomes apparent that the controller (hub) must be able to scale, too. Scaling up would require expansion of the system upon which the controller is deployed, which may require replacement. You can imagine the reluctance of operations to essentially shut down the entire network while that occurs. The other option is to scale out, vis à vis traditional methods of scaling other systems, via a load balancing service and duplicate instances. This implies a shared-something architecture, usually describes as being the database or repository for policy from which nodes are “programmed” by the controller. This appears to be the response in existing implementations, with “clusters of controllers” providing the scale and resiliency required. So scaling out the controller, then becomes an exercise in traditional scalability methods used to scale out client-server architectures. So the Control Plane of an SDN Can Scale. What’s the Problem then? As pointed out by Ivan Pepelnjak in “OpenFlow/SDN Is Not A Silver Bullet For Network Scalability”, the problem with this model appears to be response time. Failure in a node cannot be addressed fast enough by a centralized software system, particularly not one that relies on a database (which has its own scalability issues, of course). There are several questions that must be answered in order to even deal with failure that pose some interesting performance and scaling challenges. How does the controller know that an element node is “down”? Is it polling, which introduces an entirely new concern regarding the level of monitoring noise on the network interfering with business-related traffic? Is it monitoring a persistent control-channel connection between the controller and the node? Certainly this would indicate nearly instantaneously the status of the node, but introduces scaling challenges as maintaining even a one-to-one control-channel connection per element node in the network would consume large quantities of memory (and ultimately have a negative impact on performance, requiring scale out much sooner than may be otherwise necessary). Does a neighbor or upstream element tattle on the downed node when it doesn’t respond? There are a variety of mechanisms that could be used to monitor the network such that the controller is informed of a failure, but each brings with it new challenges and has different responsiveness profiles. Polling is tricky, as any load balancing provider will tell you, because it’s based on a timed interval. Persistent connections, as noted earlier, bring scalability challenges back to the table. Tattle-tale methodologies are unreliable, requiring that a neighbor or upstream element have the need to “talk to” the downed down before notification can occur, leaving open the possibility of a downed node going unnoticed until it’s too late. How does the controller respond to a downed element node? Obviously the controller needs to “route around” or “detour” traffic until a replacement can be deployed (virtually or physically). This no doubt requires some calculations to determine the best route (OSPF anyone?) if done in real-time. Some have suggested alternative routes in tables be available on each node in the event of a failure, a model more closely related to today’s existing routing practices and one that would certainly respond much better to failure in the network than would a system in which the controller must discover and reconfigure the network to adjust to failures. What happens to existing flows when an element node fails? Ah, the age old stateful failure challenge. This is one that is (almost) solved with redundant architectures that mirror sessions (flows) to a secondary device. The problem is that these models work best, i.e. have the highest levels of success, for full-proxy devices, particularly when the flow supports stateful/connection-oriented protocols. These questions are nothing new to experienced EAI practitioners who’ve had to suffer through a hub-and-spoke based integration effort. Failure in a node or of the controller give rise to painful fire-drill exercises, the likes of which no one really enjoys because they are highly disruptive. They’re also not really new questions for those with a long history in load balancing and high availability architectures. Still, these are questions which need to be answered in the context of the network, which has somewhat different uptime and performance requirements than even applications. Ultimately the answer is going to lie in architecture, and it’s unlikely that what results will be a single, centrally controlled one. QoS without Context: Good for the Network, Not So Good for the End user Applying ‘Centralized Control, Decentralized Execution’ to Network Architecture F5 Friday: ADN = SDN at Layer 4-7 SDN, OpenFlow, and Infrastructure 2.0 Cyclomatic Complexity of OpenFlow-Based SDN May Drive Market Innovation Five needs driving SDNs251Views0likes0CommentsApplying ‘Centralized Control, Decentralized Execution’ to Network Architecture
#SDN brings to the fore some critical differences between concepts of control and execution While most discussions with respect to SDN are focused on a variety of architectural questions (me included) and technical capabilities, there’s another very important concept that need to be considered: control and execution. SDN definitions include the notion of centralized control through a single point of control in the network, a controller. It is through the controller all important decisions are made regarding the flow of traffic through the network, i.e. execution. This is not feasible, at least not in very large (or even just large) networks. Nor is it feasible beyond simple L2/3 routing and forwarding. HERE COMES the SCIENCE (of WAR) There is very little more dynamic than combat operations. People, vehicles, supplies – all are distributed across what can be very disparate locations. One of the lessons the military has learned over time (sometimes quite painfully through experience) is the difference between control and execution. This has led to decisions to employ what is called, “Centralized Control, Decentralized Execution.” Joint Publication (JP) 1-02, Department of Defense Dictionary of Military and Associated Terms, defines centralized control as follows: “In joint air operations, placing within one commander the responsibility and authority for planning, directing, and coordinating a military operation or group/category of operations.” JP 1-02 defines decentralized execution as “delegation of execution authority to subordinate commanders.” Decentralized execution is the preferred mode of operation for dynamic combat operations. Commanders who clearly communicate their guidance and intent through broad mission-based or effects-based orders rather than through narrowly defined tasks maximize that type of execution. Mission-based or effects-based guidance allows subordinates the initiative to exploit opportunities in rapidly changing, fluid situations. -- Defining Decentralized Execution in Order to Recognize Centralized Execution * Lt Col Woody W. Parramore, USAF, Retired Applying this to IT network operations means a single point of control is contradictory to the “mission” and actually interferes with the ability of subordinates (strategic points of control) to dynamically adapt to rapidly changing, fluid situations such as those experienced in virtual and cloud computing environments. Not only does a single, centralized point of control (which in the SDN scenario implies control over execution through admittedly dynamically configured but rigidly executed) abrogate responsibility for adapting to “rapidly changing, fluid situations” but it also becomes the weakest link. Clausewitz, in the highly read and respected “On War”, defines a center of gravity as "the hub of all power and movement, on which everything depends. That is the point against which all our energies should be directed." Most military scholars and strategists logically imply from the notion of a Clausewitzian center of gravity is the existence of a critical weak link. If the “controller” in an SDN is the center of gravity, then it follows it is likely a critical, weak link. This does not mean the model is broken, or poorly conceived of, or a bad idea. What it means is that this issue needs to be addressed. The modern strategy of “Centralized Control, Decentralized Execution” does just that. Centralized Control, Decentralized Execution in the Network The major issue with the notion of a centralized controller is the same one air combat operations experienced in the latter part of the 20th century: agility, or more appropriately, lack thereof. Imagine a large network adopting fully an SDN as defined today. A single controller is responsible for managing the direction of traffic at L2-3 across the vast expanse of the data center. Imagine a node, behind a Load balancer, deep in the application infrastructure, fails. The controller must respond and instruct both the load balancing service and the core network how to react, but first it must be notified. It’s simply impossible to recover from a node or link failure in 50 milliseconds (a typical requirement in networks handling voice traffic) when it takes longer to get a reply from the central controller. There’s also the “slight” problem of network devices losing connectivity with the central controller if the primary uplink fails. -- OpenFlow/SDN Is Not A Silver Bullet For Network Scalability, Ivan Pepelnjak (CCIE#1354 Emeritus) Chief Technology Advisor at NIL Data Communications The controller, the center of network gravity, becomes the weak link, slowing down responses and inhibiting the network (and IT) from responding in a rapid manner to evolving situations. This does not mean the model is a failure. It means the model must adapt to take into consideration the need to adapt more quickly. This is where decentralized execution comes in, and why predictions that SDN will evolve into an overarching management system rather than an operational one are likely correct. There exist today, within the network, strategic points of control; locations within the data center architecture at which traffic (data) is aggregated, forcing all data to traverse, from which control over traffic and data is maintained. These locations are where decentralized execution can fulfill the “mission-based guidance” offered through centralized control. Certainly it is advantageous to both business and operations to centrally define and codify the operating parameters and goals of data center networking components (from L2 through L7), but it is neither efficient nor practical to assume that a single, centralized controller can achieve both managing and executing on the goals. What the military learned in its early attempts at air combat operations was that by relying on a single entity to make operational decisions in real time regarding the state of the mission on the ground, missions failed. Airmen, unable to dynamically adjust their actions based on current conditions, were forced to watch situations deteriorate rapidly while waiting for central command (controller) to receive updates and issue new orders. Thus, central command (controller) has moved to issuing mission or effects-based objectives and allowing the airmen (strategic points of control) to execute in a way that achieves those objectives, in whatever way (given a set of constraints) they deem necessary based on current conditions. This model is highly preferable (and much more feasible given today’s technology) than the one proffered today by SDN. It may be that such an extended model can easily be implemented by distributing a number of controllers throughout the network and federating them with a policy-driven control system that defines the mission, but leaves execution up to the distributed control points – the strategic control points. SDN is new, it’s exciting, it’s got potential to be the “next big thing.” Like all nascent technology and models, it will go through some evolutionary massaging as we dig into it and figure out where and why and how it can be used to its greatest potential and organizations’ greatest advantage. One thing we don’t want to do is replicate erroneous strategies of the past. No network model abrogating all control over execution has every really worked. All successful models have been a distributed, federated model in which control may be centralized, but execution is decentralized. Can we improve upon that? I think SDN does in its recognition that static configuration is holding us back. But it’s decision to reign in all control while addressing that issue may very well give rise to new issues that will need resolution before SDN can become a widely adopted model of networking. QoS without Context: Good for the Network, Not So Good for the End user Cyclomatic Complexity of OpenFlow-Based SDN May Drive Market Innovation SDN, OpenFlow, and Infrastructure 2.0 OpenFlow/SDN Is Not A Silver Bullet For Network Scalability Prediction: OpenFlow Is Dead by 2014; SDN Reborn in Network Management OpenFlow and Software Defined Networking: Is It Routing or Switching ? Cloud Security: It’s All About (Extreme Elastic) Control Ecosystems are Always in Flux The Full-Proxy Data Center Architecture378Views0likes0CommentsCyclomatic Complexity of OpenFlow-Based SDN May Drive Market Innovation
#openflow #sdn Programmability and reliability rarely go hand in hand, especially when complexity and size increase, which creates opportunity for vendors to differentiate with a vetted ecosystem I’m reading (a lot) on SDN these days. That means reading on OpenFlow, as the two are often tied together at the hip. In an ONF white paper on the topic, there were two “substantial” benefits with respect to an OpenFlow-based SDN that caught my eye as they appear to be in direct conflict with one another. 1. Programmability by operators, enterprises, independent software vendors, and users (not just equipment manufacturers) using common programming environments, which gives all parties new opportunities to drive revenue and differentiation. 2. Increased network reliability and security as a result of centralized and automated management of network devices, uniform policy enforcement, and fewer configuration errors. I am not sure it’s possible to claim both increased network reliability and security as well as programmability as being complementary benefits. Intuitively, network operators know that “messing with” routing and switching stability through programmability is a no-no. Several network vendors have discovered this in the past when programmable core network infrastructure was introduced. The notion of programmability for management purposes is acceptable, the notion of programmability for modification of function is not. Most network operators cannot articulate their unease with such notions. This is generally because there is a gap between developers and network operators’ core foci. Those developers who’ve decided to plunge into graduate school can – or should be able – to articulate exactly from where this unease comes, and why. It is the reason why points 1 and 2 conflict, and why I continue to agree with pundits who predict SDN will become the method of management for dynamic data centers but will not become the method of implementing new functions in core routing and switching a la via OpenFlow. Code Complexity and Error Rates Most folks understand higher complexity incurs higher risk. This is not only an intuitive understanding that transcends code outwards all the way to the data center architecture but it also been proven through studies. According to Steve McConnell in “Code Complete” (a staple of many developers), a study at IBM found “the most error-prone routines were those that were larger than 500 lines of code.” McConnell also notes a study by Lind and Vairavan that “code needed to be changed least when routines averaged 100 to 150 lines of code.” But it is not just the number of lines of code that contribute to the error rate within source code. Cyclomatic complexity increases the potential for errors; that is, the more conditional paths possible in logic the higher the cyclomatic complexity. The cyclomatic complexity of a section of source code is the count of the number of linearly independent paths through the source code. For instance, if the source code contained no decision points such as IF statements or FOR loops, the complexity would be 1, since there is only a single path through the code. If the code had a single IF statement containing a single condition there would be two paths through the code, one path where the IF statement is evaluated as TRUE and one path where the IF statement is evaluated as FALSE. -- Wikipedia, Cyclomatic complexity An example of this is seen in the OpenFlow wiki tutorial, illustrating some (very basic) pseudocode: Sample Pseudocode Your learning switch should learn the port of hosts from packets it receive. This is summarized by the following sequence, run when a packet is received:179Views0likes0CommentsF5 Friday: Ops First Rule
#cloud #microsoft #iam “An application is only as reliable as its least reliable component” It’s unlikely there’s anyone in IT today that doesn’t understand the role of load balancing to scale. Whether cloud or not, load balancing is the key mechanism through which load is distributed to ensure horizontal scale of applications. It’s also unlikely there’s anyone in IT that doesn’t understand the relationship between load balancing and high-availability (reliability). High-Availability (HA) architectures are almost always implemented using load balancing services to ensure seamless transition from one service instance to another in the event of a failure. What’s often overlooked is that scalability and HA isn’t important just for applications. Services – whether application or network-focused – must also be reliable. It’s the old “only as strong as the weakest link in the chain” argument. An application is only as reliable as its least reliable component – and that includes services and infrastructure upon which that application relies. It is – or should be – ops first rule; the rule that guides design of data center architectures. This requirement becomes more and more obvious as emerging architectures combining the data center and cloud computing are implemented, particularly when federating identity and access services. That’s because it is desirable to maintain control over the identity and access management processes that authenticate and authorize use of applications no matter where they may be deployed. Such an architecture relies heavily on the corporate identity store as the authoritative source of both credentials and permissions. This makes the corporate identity store a critical component in the application dependency chain, one that must necessarily be made as reliable as possible. Which means you need load balancing. A good example of how this architecture can be achieved is found in BIG-IP load balancing support for Microsoft’s Active Directory Federation Services (AD FS). AD FS and F5 Load Balancing Microsoft’s Active Directory Federation Services, (AD FS) sever role is an identity access solution that extends the single sign-on, (SSO) experience for directory-authenticated clients, (typically provided on the Intranet via Kerberos), to resources outside of the organization’s boundaries, such as cloud computing environments. To ensure high-availability, performance, and scalability the F5 BIG-IP Local Traffic Manager (LTM) can be deployed to load balance an AD FS server farm. There are several scenarios in which BIG-IP can load balance AD FS services. 1. To enable reliability of AD FS for internal clients accessing external resources, such as those hosted in Microsoft Office 365. This is the simplest of architectures and the most restrictive in terms of access for end-users as it is limited to only internal clients. 2. To enable reliability of AD FS and AD FS proxy servers, which provide external end-user SSO access to both internal federation-enabled resources as well as partner resources like Microsoft Office 365. This is a more flexible option as it serves both internal and external clients. 3. BIG-IP Access Policy Manager (APM) can replace the need for AD FS proxy servers required for external end-user SSO access, which eliminates another tier and enables pre-authentication at the perimeter, offering both the flexibility required (supporting both internal and external access) as well as a more secure deployment. In all three scenarios, F5 BIG-IP serves as a strategic point of control in the architecture, assuring reliability and performance of services upon which applications are dependent, particularly those of authentication and authorization. Using BIG-IP APM instead of AD FS proxy servers both simplifies and makes more agile the architecture. This is because BIG-IP APM is inherently more programmable and flexible in terms of policy creation. BIG-IP APM, being deployed on the BIG-IP platform, can take full advantage of the context in which requests are made, ensuring that identity and access control go beyond simple credentials and take into consideration device, location, and other contextual-clues that enable a more secure system of authentication and authorization. High-availability – and ultimately scalability - is preserved for all services by leveraging the core load balancing and HA functionality of the BIG-IP platform. All components in the chain are endowed with HA capabilities, making the entire application more resilient and able to withstand minor and major failures. Using BIG-IP LTM for load balancing AD FS serves as an adaptable and extensible architectural foundation for a phased deployment approach. As a pilot phase, rolling out AD FS services for internal clients only makes sense, and is the simplest in terms of its implementation. Using BIG-IP as the foundation for such an architecture enables further expansion in subsequent phases, such as introducing BIG-IP APM in a phase two implementation that brings flexibility of access location to the table. Further enhancements can then be made regarding access when context is included, enabling more complex and business-focused access policies to be implemented. Time-based restrictions on clients or location can be deployed and enforced, as is desired or needed by operations or business requirements. Reliability is a Least Common Factor Problem Reliability must be enabled throughout the application delivery chain to ultimately ensure reliability of each application. Scalability is further paramount for those dependent services, such as identity and access management, that are intended to be shared across multiple applications. While certainly there are many other load balancing services that could be used to enable reliability of these services, an extensible and highly scalable platform such as BIG-IP is required to ensure both reliability and scalability of shared services upon which many applications rely. The advantage of a BIG-IP-based application delivery tier is that its core reliability and scalability services extend to any of the many services that can be deployed. By simplifying the architecture through application delivery service consolidation, organizations further enjoy the benefits of operational consistency that keeps management and maintenance costs reduced. Reliability is a least common factor problem, and Ops First Rule should be applied when designing a deployment architecture to assure that all services in the delivery chain are as reliable as they can be. F5 Friday: BIG-IP Solutions for Microsoft Private Cloud BYOD–The Hottest Trend or Just the Hottest Term The Four V’s of Big Data Hybrid Architectures Do Not Require Private Cloud The Cost of Ignoring ‘Non-Human’ Visitors Complexity Drives Consolidation What Does Mobile Mean, Anyway? At the Intersection of Cloud and Control… Cloud Bursting: Gateway Drug for Hybrid Cloud Identity Gone Wild! Cloud Edition223Views0likes0CommentsF5 Friday: Killing Two Birds with One (Solid State) Stone
Sometimes mitigating operational risk is all about the hardware. MTBF. Mean Time Between Failure. An important piece of this often-used but rarely examined acronym is the definition of “mean”: The quotient of the sum of several quantities and their number; an average An average. That means just as many folks experienced failure later than the value as did earlier. And it is the earlier that is particularly troublesome when it comes to the data center. Customers replace disk drives at rates far higher than those suggested by the estimated mean time between failure (MTBF) supplied by drive vendors, according to a study of about 100,000 drives conducted by Carnegie Mellon University. (PCWorld, “Study: Hard Drive Failure Rates Much Higher Than Makers Estimate) An eWeek commentary referencing the same study, which tracked drives heavily used for storage and web servers, found “annual disk replacements rates were more in the range of 2 to 4 percent and were as high as 13 percent for some sites.” It is likely the case that when evaluating MTBF rates for disk drives, the more intense, volatile access associated with storage and web servers increases the strain on components that lead to earlier and more frequent failure rates. Fast forward to today’s use of virtualization, particularly for shared services such as storage and compute, and volatile access certainly fits the bill. So it makes sense that network components relying on disk drives that process highly volatile data – such as caches and WAN optimization controllers - would also be more likely to experience higher rates of failure than if the same drive were in, say, your mom’s PC. If you’re the geek of the family, then a failure in mom’s drive is not exactly a pleasant experience, but if you’re an admin in an IT organization and the drive in a network component fails, well, I think we can all agree that’s an even less pleasant situation. But failure rates isn’t the only issue with network components and disk drives. Performance, too, is a factor. The slowest functions today on any computing machine is I/O. Whether that I/O is for graphics or storage makes little difference (unless you happen to have a graphics accelerator card, then storage is almost certainly your worst performing component). The latency for reading and writing data is often dismissed by consumers as negligible, but that’s because latency doesn’t cost them anything. Performance for IT organizations and their businesses is critical, with mere seconds incurring losses for some industries at a rate of thousands or more per second. If a network component in the data flow is causing performance problems – especially those for which the value proposition is performance-enhancement – then it’s a net negative for IT and the business and the component may not survive a strategy audit. So sometimes it really is about the hardware. Now, F5 is almost always strategically in the data path. Whether it’s BIG-IP LTM (Local Traffic Manager) providing load balancing, BIG-IP WebAccelerator performing dynamic caching duties, or BIG-IP WOM (WAN Optimization Manager) optimizing and accelerating data transfers over WAN links, we are in the data path. Anything We can do to increase reliability and performance is a Very Good Thing™ both for us and for our customers. Because We own the hardware, We can make choices with respect to which components We want to leverage to ensure the fastest, most reliable platform We can. So We’ve made some of those decisions, lately, and the result is a new hardware platform – the 11000. The primary benefit of the new platform? You got it, SSD (Solid State Disk) as an alternative to traditional hard-drives. The 11000 handles up to 20 Gbps of LAN-side throughput, 16 Gbps of hardware compression, and optional solid-state drives, which greatly reduces risk of failure (availability/reliability) while simultaneously improving performance. That’s two of the three components of operational risk. The third, security, is not directly addressed by SSDs, although the performance improvement when encrypting data at rest could be a definite plus. I could go on, but my cohort Don MacVittie has already posted not one, but two excellent overviews of the new platform and, as a bonus just for you, he’s always penned a post on a related announcement around our new FIPS-compliant platforms. F5 11000 Platform Resources: Security, not HSMs, in Droves Speed, Volume, and F5. The Need For Speed. SSDs in the Enterprise. F5 BIG-IP Platform Security BIG-IP Hardware Datasheet BIG-IP Hardware Updates F5 Introduces High-Performance Platforms to Help Organizations Optimize Application Delivery and Reduce Costs Related blogs & articles: Operational Risk Comprises More Than Just Security All F5 Friday posts F5 Friday: Latency, Logging, and Sprawl F5 Friday: Performance, Throughput and DPS Data Center Optimization is Like NASCAR without the Beer184Views0likes0CommentsWhen Black Boxes Fail: Amazon, Cloud and the Need to Know
If Amazon’s Availability Zone strategy had worked as advertised its outage would have been non-news. But then again, no one really knows what was advertised… There’s been a lot said about the Amazon outage and most of it had to do with cloud and, as details came to light, about EBS (Elastic Block Storage). But very little mention was made of what should be obvious: most customers didn’t – and still don’t - know how Availability Zones really work and, more importantly, what triggers a fail over. What’s worse, what triggers a fail back? Amazon’s documentation is light. Very light. Like cloud light. Availability Zones are distinct locations that are designed to be insulated from failures in other zones. This allows you to protect your applications from possible failures in a single location. Availability Zones also provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. -- Announcement: New Amazon EC2 Availability Zone in US East Now, it’s been argued that part of the reason Amazon’s outage was so impactful was a lack of meaningful communication on the part of Amazon regarding the outage. And Amazon has admitted to failing in that regard and has promised to do better in the future. But just as impactful is likely the lack of communication before the outage; before deployment. After all, the aforementioned description of Availability Zones, upon which many of the impacted customers were relying to maintain availability, does not provide enough information to understand how Availability Zones work, nor how they are architected. TRIGGERING FAIL-OVER Scrounging around the Internet turns up very little on how Availability Zones work, other than isolating instances from one another and being physically independent for power. In the end, what was promised was a level of isolation that would mitigate the impact of an outage in one zone. Turns out that didn’t work out so well but more disconcerting is that there is still no explanation regarding what kind of failure – or conditions – result in a fail over from one Availability Zone to another. Global Application Delivery Services (the technology formerly known as Global Server load balancing) is a similar implementation generally found in largish organizations. Global application delivery can be configured to “fail over” in a variety of ways, based on operational or business requirements, with the definition of “failure” being anything from “OMG the data center roof fell in and crushed the whole rack!” to “Hey, this application is running a bit slower than it should, can you redirect folks on the east coast to another location? Kthanxbai.” It allows failover in the event of failure and redirection based on failure to meet operational and business goals. It’s flexible and allows the organization to define “failure” based on its needs and on its terms. But no such information appears to exist for Amazon’s Availability Zones and we are left to surmise that it’s like based on something more akin to the former rather than the latter given their rudimentary ELB (Elastic Load Balancing) capabilities. When organizations architect and subsequently implement disaster recovery or high-availability initiatives – and make no mistake, that’s what using multiple Availability Zones on Amazon is really about – they understand the underlying mechanism and triggers that cause a “failover” from one location to another. This is the critical piece of information, of knowledge, that’s missing. In an enterprise-grade high-availability architecture it is often the case that such triggers are specified by both business and operational requirements and may include performance degradation as well as complete failure. Such triggers may be based on a percentage of available resources, or other similar resource capacity constraints. Within an Amazon Availability Zone apparently the trigger is a “failure”, but a failure of what is left to our imagination. But also apparently missing was a critical step in any disaster recovery/high availability plan: testing. Enterprise IT not only designs and implements architectures based on reacting to a failure, but actually tests that plan. It’s often the case that such plans are tested on a yearly basis, just to ensure all the moving parts still work as expected. Relying on Amazon – or any cloud computing environment in which resources are shared – makes it very difficult to test such a plan. After all, in order to test failover from one Availability Zone to another Amazon would have to forcibly bring down an Availability Zone – and every application deployed within it. Consider how disruptive that process might be if customers started demanding such tests on their schedule. Obviously this is not conducive to Amazon maintaining its end of the uptime bargain for those customers not deployed in multiple Availability Zones. AVAILBILITY and RELIABILITY: CRITICAL CONCERNS BORNE OUT by AMAZON FAILURE Without the ability to test such plans, we return to the core issue – trust. Organizations relying wholly on cloud computing providers must trust the provider explicitly. And that generally means someone in the organization must understand how things work. Black boxes should not be invisible boxes, and the innate lack of visibility into the processes and architecture of cloud computing providers will eventually become as big a negative as security was once perceived to be. Interestingly enough, a 2011 IDG report on global cloud computing adoption shows that high performance (availability and reliability) are the most important – 5% higher than security. Amazon’s epic failure will certainly do little to alleviate concerns that public cloud computing is not ready for mission critical applications. Now, it’s been posited that customers were at fault for trusting Amazon in the first place. Others laid the blame solely on the shoulders of Amazon. Both are to blame, but Amazon gets to shoulder a higher share of that blame if for no other reason than it failed to provide the information necessary for customers to make an informed choice regarding whether or not to trust their implementation. This has been a failing of Amazon’s since it first began offering cloud computing services – it has consistently failed to offer the details necessary for customers to understand how basic processes are implemented within its environment. And with no ability to test failover across Availability Zones, organizations are at fault for trusting a technology without understanding how it works. What’s worse, many simply didn’t even care – until last week. Now it may be the case that Amazon is more than willing to detail such information to customers; that it has adopted a “need to know” policy regarding its architecture and implementation and its definition of “failure.” If that is the case, then it behooves customers to ask before signing on the dotted line. Because customers do need to know the details to ensure they are comfortable with the level of reliability and high-availability being offered. If that’s not the case or customers are not satisfied with the answers, then it behooves them to – as has been suggested quite harshly by many cloud pundits – implement alternative plans that involve more than one provider. A massive failure on the part of a public cloud computing provider was bound to happen eventually. If not Amazon than Azure, if not Azure then it would have been someone else. What’s important now is take stock and learn from the experience – both providers and customers – such that a similar event does not result in the same epic failure again. Two key points stand out: The need to understand how services work when they are provided by a cloud computing provider. Black box mentality is great marketing (hey, no worries!) but in reality it’s dangerous to the health and well-being of applications deployed in such environments because you have very little visibility into what’s really going on. The failure to understand how Amazon’s Availability Zones actually worked – and exactly what constituted “isolation” aside from separate power sources as well as what constitutes a “failure” – lies strictly on the customer. Someone within the organization needs to understand how such systems work from the bottom to the top to ensure that such measures meet requirements. The need to implement a proper disaster recovery / high availability architecture and test it. Proper disaster recovery / high availability architectures are driven by operational goals which are driven by business requirements. A requirement for 100% uptime will likely never be satisfied by a single site, regardless of provider claims. And if you can’t test it but need to guarantee high uptime and can’t get the details necessary to understand – and trust – the implementation to provide that uptime, perhaps it’s not the right choice in the first place. Or perhaps you just need to demand an explanation. Bye, Bye My Clustered AMIs…A Cloud Tribute to Don McLean Amazon Web Services apologizes, promises better communication, transparency Cloud Failure, FUD, and The Whole AWS Oatage… They're Called Black Boxes Not Invisible Boxes Dynamic Infrastructure: The Cloud within the Cloud How to Earn Your Data Center Merit Badge Disaster Recovery: Not Just for Data Centers Anymore Cloud Control Does Not Always Mean ‘Do it yourself’ Control, choice, and cost: The conflict in the cloud238Views0likes0CommentsReliability? We've got your reliability right here...
When talking about IT performance and rating "must haves", data center reliability is often right near the top of the list, and for good reason. Performance and scalability , features and functionality don't matter much unless the application is up and available. We here at F5 tend to hold availability in pretty high regard, and recent info from Netcraft seems to show that this effort has not gone in vain. Netcraft likes to study and analyze many things, among which is the reliability of different hosting companies. The way they do this is by polling around forty different hosting providers' websites at 15 minute intervals from different locations around the net, then crunching those numbers into something meaningful. Often near the top of the list of the most reliable hosting companies is Rackspace. I hear what you're asking, "As cool as they are, what does Rackspace have to do with F5, and why are you yammering on about them?". Pictures, as they say, are worth quite a few words, so feast your eyes on this: Source: http://news.netcraft.com/archives/2011/05/02/most-reliable-hosting-company-sites-in-april-2011.html Still don't see it? Of special interest, to me at least, is the "OS" listed for the Rackspace entry. While F5 BIG-IP might not technically be an OS (it's oh so much more!), it's still wicked fun to see it at the top of a reliability list. So thanks, Rackspace, for maintaining a highly available architecture and using F5 gear to help do it. Keep up the good work. #Colin173Views0likes0CommentsData Center Feng Shui: Reliability is not the Absence of Failure
But rather it is the ability to compensate for it. Redundancy. It’s standard operating procedure for everyone who deals with technology – even consumers. Within IT we’re a bit more stringent about how much redundancy we build into the data center. Before commoditization and the advent of cheap computing (a.k.a. cloud computing ) we worried about redundant power supplies and network connections. We leveraged fail-over as a means to ensure that when the inevitable happened, a second, minty-fresh server/application/switch was ready to take over without dropping so much as a single packet on the data center floor. Notice I said “inevitable.” That’s important, because we know with near-absolute certainty that inevitably hardware and software fails. Interestingly, it is only hardware that comes with an MTBF (Mean Time Between Failures) rating. It is nearly as inevitable that software (applications) will also experience some sort of failure – whether due to error or overload or because of a dependency on some other piece of software that will, too, inevitably fail because of hardware or internal issues. Failure happens. That doesn’t mean, however, that an application or an architecture or a network is unreliable. Reliability is not an absence of failure in the data center, it’s a measure of how quickly such failures can be compensated for. Being able to rely upon an application does not mean it never fails, it simply means that such failures as do occur are corrected for or otherwise addressed quickly, before they have an impact on the availability of the application. And that means the entire application delivery chain. ARCHITECTURAL RELIABILITY We’ve been building application delivery architectures for a while now with a key design goal in mind: built to fail. We assume that any given piece of the architecture might go belly up at any time, and architect a solution that takes that into consideration to ensure that availability is never (or as minimally as possible) impacted. Hardware. Software. Any given piece of a critical system should be able to fail without negatively impacting availability. Performance may degrade, but availability itself is maintained. This often takes the form of “standby” systems; duplicates of a given infrastructure or application service that, in the event of a failure, are ready to stand in for the primary and continue doing what needs to be done. They’re the second-stringers, the bench warmers, the idle resources that are the devil’s playground in the data center. And we’re getting rid of them. As we optimize the data center for cost and efficiency, we’re eliminating the redundant duplication (see what I did there?) within the architecture and replacing it with something more aligned with the business goals of maximizing the return on investment associated with all that hardware and software that makes the business go. We’re automating fail-over processes that no longer assume a secondary exists: instead, we automatically provision a new primary in the event of a failure from the much larger pool of resources that were once reserved. We’re modifying the notion of architectural reliability to mean we don’t need to fail-over, we’ll just fail-through instead. And that works, except when it doesn’t. SINGLE-POINTS of FAILURE The danger here is two-fold: first, that we will run short of resources and be unable to handle any failure and second, that we can guarantee that the provisioning process can occur nearly simultaneously with the failure. It can’t. At least not yet. And while we’re getting quite good at leveraging intelligent health monitoring and collaborative infrastructure architectures, we still haven’t figured out how to predict a failure. Auto-scaling works because it does not account for failure. It assumes infinite resources and consistent availability. We can tell when an application is nearing capacity and adjust the resources accordingly before it becomes necessary. And it is exactly that “before” that is important in maintaining availability and thus providing a reliable application. But we can’t predict a failure, and thus we can’t know when to begin provisioning until it’s essentially too late. There are only two viable solutions: pre-provisioning, which defeats the purpose of such real-time automation and scalability services in the first place and reserved resources, which can have a deleterious affect on efficiency and costs – you’re purposefully creating a pool of idle resources again. Both tactics have the same effect: idle resources waiting to be needed which runs contrary to one of the desired intents of implementing a virtualized or cloud computing-based infrastructure in the first place. Thus, the definition of reliability as it pertains to our new, agile and cloud-based applications is directly related to the longest time required to either replace or provision any comprised component. Single points of failure, you see, are very bad for reliability. Especially when they are virtualized and it may the case that there are no resources available that can be used to “replace” the failed component. This is particularly important to note as we start to virtualize the infrastructure. It sounds like a good deal: virtual network appliances dramatically decrease the CapEx associated with such investments, but operationally you still have the same challenges to address. You still need a redundant system and they must reside on physically separate systems, in case the hardware upon which the virtual network appliance is deployed itself fails. That’s true as well for applications; redundancy must be system-wide which means two instances of the same application on the same physical device invites unreliability. And when you realize that you’re going to need a physical system for every instance of a virtual network appliance, you might start wondering why it was that you virtualized them in the first place. Especially when you consider you exchanged nearly instantaneous serial-based fail-over for pretty fast network-based failure and a largely reduced capacity per instance. And of course any gains provided by purpose-built hardware acceleration that cannot easily (or cheaply) be duplicated in a virtualized environment. Oh, and let’s not forget the potential of creating a single point of failure where there was none by eliminating the fail-to-wire option of so many infrastructure components. Almost every proxy-based network component “fails to wire” in the event of a failure resulting in a loss of functionality but not the ability to pass data, which means availability of the application is not compromised in the event a failure, although security or other functionality might be.Yes, you gained architectural multi-tenancy and simplified provisioning, but the need for such an implementation is quickly being erased by the rush by vendors to provide true multi-tenancy for network-based infrastructure and many of the gains in provisioning can be achieved using the same infrastructure 2.0 capable methods (APIs, SDKs) that are used to integrate virtual form factors. The ability to react quickly, for agile operations, depends heavily on underlying architectural decisions. And it is the ability to react nearly instantaneously to failures throughout the entire infrastructure that enables a reliable, consistent application. Consider carefully the pros and cons of virtualization in every aspect of a deployment as it relates specifically to reliability with an eye toward aligning architectural decisions with business and operational requirements. This includes the business making decisions regarding “mission critical” applications. Not every application is mission critical, and understanding which applications are truly vital will go a long way toward cutting costs in infrastructure and management. A mission-critical application reliability requirement of 100% will likely remove some components from the virtualization list and potentially impact decisions regarding resource allocation/reservation systems. Single points of failure must be eliminated in critical application delivery chains to ensure reliability. Failure will happen, eventually, and a reliable infrastructure takes that into account and ensures a timely response as a means to avoid downtime and its associated costs. Data Center Feng Shui Operational Risk Comprises More Than Just Security The Number of the Counting Shall be Three (Rules of Thumb for Application Availability) Data Center Feng Shui: Fault Tolerance and Fault Isolation All Data Center Feng Shui posts on DevCentral Architectural Multi-tenancy The Question Shouldn’t Be Where are the Network Virtual Appliances but Where is the Architecture? I CAN HAS DEFINISHUN of SoftADC and vADC? The Devil is in the Details VM Sprawl is Bad but Network Sprawl is Badder173Views0likes0CommentsThe Number of the Counting Shall be Three (Rules of Thumb for Application Availability)
Three shall be the number thou shalt count, and the number of the counting shall be three. If you’re concerned about maintaining application availability, then these three rules of thumb shall be the number of the counting. Any less and you’re asking for trouble. I like to glue animals to rocks and put disturbing amounts of electricity and saltwater NEXT TO EACH OTHER Last week I was checking out my saltwater reef when I noticed water lapping at the upper edges of the tank. Yeah, it was about to overflow. Somewhere in the system something had failed. Not entirely, but enough to cause the flow in the external sump system to slow to a crawl and the water levels in the tank to slowly rise. Troubleshooting that was nearly as painful as troubleshooting the cause of application downtime. As with a data center, there are ingress ports and egress ports and inline devices (protein skimmers) that have their own flow rates (bandwidth) and gallons per hour processing capabilities (capacity) and filtering (security). When any one of these pieces of the system fails to perform optimally, well, the entire system becomes unreliable, instable, and scary as hell. Imagine a hundred or so gallons of saltwater (and all the animals inside) floating around on the floor. Near electricity. The challenges to maintaining availability in a marine reef system are similar to those in an application architecture. There are three areas you really need to focus on, and you must focus on all three because failing to address any one of them can cause an imbalance that may very well lead to an epic fail. RELIABILITY Reliability is the cornerstone of assuring application availability. If the underlying infrastructure – the hardware and software – fails, the application is down. Period. Any single point of failure in the delivery chain – from end-to-end – can cause availability issues. The trick to maintaining availability, then, is redundancy. It is this facet of availability where virtualization most often comes into play, at least from the application platform / host perspective. You need at least two instances of an application, just in case. Now, one might think that as long as you have the capability to magically create a secondary instance and redirect application traffic to it if the primary application host fails that you’re fine. You’re not. Creation, boot, load time…all impact downtime and in some cases, every second counts. The same is true of infrastructure. It may seem that as long as you could create, power up, and redirect traffic to a virtual instance of a network component that availability would be sustained, but the same timing issues that plague applications will plague the network, as well. There really is no substitute for redundancy as a means to ensure the reliability necessary to maintain application availability. Unless you find prescient, psychic components (or operators) capable of predicting an outage at least 5-10 minutes before it happens. Then you’ve got it made. Several components are often overlooked when it comes to redundancy and reliability. In particular, internet connectivity is often ignored as a potential point of failure or, more often the case, it is viewed as one of those “things beyond our control” in the data center that might cause an outage. Multiple internet connections are expensive, understood. That’s why leveraging a solution like link load balancing makes sense. If you’ve got multiple connections, why not use them both and use them intelligently – to assist in efforts to maintain/improve application performance or prioritize application traffic in and out of the data center. Doing so allows you to assure availability in the event that one connection fails, yet the connection never sits idle when things are all hunky dory in the data center. The rule of thumb for reliability is this: Like Sith lords, there should always be two of everything with automatic failover to the secondary if the primary fails (or is cut down by a Jedi knight). CAPACITY The most common cause of downtime is probably a lack of capacity. Whether it’s due to a spike in usage (legitimate or not) or simply unanticipated growth over time, a lack of compute resources available across the application infrastructure tiers is usually the cause of unexpected downtime. This is certainly one of the drivers for cloud computing and rapid provisioning models – external and internal – as it addresses the immediacy of need for capacity upon availability failures. This is particularly true in cases where you actually have the capacity – it just happens to reside physically on another host. Virtualization and cloud computing models allow you to co-opt that idle capacity and give it to the applications that need it, on-demand. That’s the theory, anyway. Reality is that there are also timing issues around provisioning that must be addressed but these are far less complicated and require fewer psychic powers than predicting total failure of a component. Capacity planning is as much art as science, but it is primarily based on real numbers that can be used to indicate when an application is nearing capacity. Because of this predictive power of monitoring and data, provisioning of additional capacity can be achieved before it’s actually needed. Even without automated systems for provisioning, this method of addressing capacity can be leveraged – the equations for when provisioning needs to begin simply change based on the amount of time needed to manually provision the resources and integrate it with the scalability solution (i.e. the Load balancer, the application delivery controller). The rule of thumb for capacity is this: Like interviews and special events, unless you’re five minutes early provisioning capacity you’re late. SECURITY Security – or lack thereof - is likely the most overlooked root cause of availability issues, especially in today’s hyper-connected environments. Denial of service attacks are just that, an attempt to deny service to legitimate users, and they are getting much harder to detect because they’ve been slowly working their way up the stack. Layer 7 DDoS attacks are particularly difficult to ferret out as they don’t necessarily have to even be “fast”, they just have to chew up resources. Consider the latest twist on the SlowLoris attack; the attack takes the form of legitimate POST requests that s-l-o-w-l-y feed data to the server, in a way that consumes resources but doesn’t necessarily set off any alarm bells because it’s a completely legitimate request. You don’t even need a lot of them, just enough to consume all the resources on web/application servers such that no one else can utilize them. Leveraging a full proxy intermediary should go quite a ways to mitigate this situation because the request is being fed to the intermediary, not the web/application servers, and the intermediary generally has more resources and is already well versed in dealing with very slow clients. Resources are not consumed on the actual servers and it would take a lot (generally hundreds of thousands to millions) of such requests to consume the resources on the intermediary. The reason such an attack works is because the miscreants aren’t using many connections, so it’s likely that in order to take out a site front-ended by such an intermediary enough connections to trigger an alert/notification would be necessary. Disclaimer:I have not tested such a potential solution so YMMV. In theory, based on how the attack works, the natural offload capabilities of ADCs should help mitigate this attack. But I digress, the point is that security is one of the most important facets of maintaining availability. It isn’t just about denial of service attacks, either, or even consuming resources. A well-targeted injection attack or defacement can cripple the database or compromise the web/application behavior such that the application no longer behaves as expected. It may respond to requests, but what it responds with is just as vital to “availability” as responding at all. As such, ensuring the integrity of application data and applications themselves is paramount to preserving application availability. The rule of thumb for security is this: If you build your security house out of sticks, a big bad wolf will eventually blow it down. Assuring application availability is a much more complex task than just making sure the application is running. It’s about ensuring enough capacity exists at the right time to scale on demand; it’s about ensuring that if any single component fails another is in place to take over, and it’s absolutely about ensuring that a lackluster security policy doesn’t result in a compromise that leads to failure. These three components are critical to the success of availability initiatives and failing to address any one of them can cause the entire system to fail. Related blogs & articles: Some Services are More Equal than Others Why Virtualization is a Requirement for Private Cloud Computing What is Network-based Application Virtualization and Why Do You Need It? Out, Damn’d Bot! Out, I Say! The Application Delivery Spell Book: Detect Invisible (Application) Stalkers WILS: How can a load balancer keep a single server site available? The New Distribution of The 3-Tiered Architecture Changes Everything What is a Strategic Point of Control Anyway? Layer 4 vs Layer 7 DoS Attack Putting a Price on Uptime211Views0likes0Comments