virtual network appliance
11 TopicsData Center Feng Shui: Fault Tolerance and Fault Isolation
Like most architectural decisions the two goals do not require mutually exclusive decisions. The difference between fault isolation and fault tolerance is not necessarily intuitive. The differences, though subtle, are profound and have a substantial impact on data center architecture. Fault tolerance is an attribute of systems and architecture that allow it to continue performing its tasks in the event of a component failure. Fault tolerance of servers, for example, is achieved through the use of redundancy in power-supplies, in hard-drives, and in network cards. In an architecture, fault tolerance is also achieved through redundancy by deploying two of everything: two servers, two load balancers, two switches, two firewalls, two Internet connections. The fault tolerant architecture includes no single point of failure; no component that can fail and cause a disruption in service. load balancing, for example, is a fault tolerant-based strategy that leverages multiple application instances to ensure that failure of one instance does not impact the availability of the application. Fault isolation on the other hand is an attribute of systems and architectures that isolates the impact of a failure such that only a single system, application, or component is impacted. Fault isolation allows that a component may fail as long as it does not impact the overall system. That sounds like a paradox, but it’s not. Many intermediary devices employ a “fail open” strategy as a method of fault isolation. When a network device is required to intercept data in order to perform its task – a common web application firewall configuration – it becomes a single point of failure in the data path. To mitigate the potential failure of the device, if something should fail and cause the system to crash it “fails open” and acts like a simple network bridge by simply forwarding packets on to the next device in the chain without performing any processing. If the same component were deployed in a fault-tolerant architecture, there would be deployed two devices and hopefully leveraging non-network based failover mechanisms. Similarly, application infrastructure components are often isolated through a contained deployment model (like sandboxes) that prevent a failure – whether an outright crash or sudden massive consumption of resources – from impacting other applications. Fault isolation is of increasing interest as it relates to cloud computing environments as part of a strategy to minimize the perceived negative impact of shared network, application delivery network, and server infrastructure.401Views0likes2CommentsData Center Feng Shui
The right form-factor in the right location at the right-time will maximize the benefits associated with cloud computing and virtualization. Feng Shui, simply defined, is the art of knowing where to place things to maximize benefits. There are many styles of Feng Shui but the goal of all forms is to create the most beneficial environment in which one can live, work, play, etc… based on the individual’s goals. Historically, feng shui was widely used to orient buildings—often spiritually significant structures such as tombs, but also dwellings and other structures—in an auspicious manner. Depending on the particular style of feng shui being used, an auspicious site could be determined by reference to local features such as bodies of water, stars, or a compass. Feng shui was suppressed in China during the cultural revolution in the 1960s, but has since seen an increase in popularity, particularly in the United States. -- Feng Shui, Wikipedia In the US, at least, Feng Shui has gained popularity primarily as it relates to interior design – the art of placing your furniture in the right places based on relationship to water, stars, and compass directions. Applying the art of Feng Shui to your data center architecture is not nearly as difficult as it may sound because essentially you’re doing the same thing: determining the best location (on or off-premise? virtual or physical? VNA or hardware?) for each network, application delivery network, and security component in the data center based on a set of organizational (business and operational) needs or goals. The underlying theory of Feng Shui is that location matters, and it is certainly true in the data center that location and form-factor matter to the harmony of the data center. The architectural decisions regarding a hybrid cloud computing infrastructure (a mix of virtual network appliances, hardware, and software) have an impact on many facets of operational and business goals.184Views0likes2CommentsData Center Feng Shui: Reliability is not the Absence of Failure
But rather it is the ability to compensate for it. Redundancy. It’s standard operating procedure for everyone who deals with technology – even consumers. Within IT we’re a bit more stringent about how much redundancy we build into the data center. Before commoditization and the advent of cheap computing (a.k.a. cloud computing ) we worried about redundant power supplies and network connections. We leveraged fail-over as a means to ensure that when the inevitable happened, a second, minty-fresh server/application/switch was ready to take over without dropping so much as a single packet on the data center floor. Notice I said “inevitable.” That’s important, because we know with near-absolute certainty that inevitably hardware and software fails. Interestingly, it is only hardware that comes with an MTBF (Mean Time Between Failures) rating. It is nearly as inevitable that software (applications) will also experience some sort of failure – whether due to error or overload or because of a dependency on some other piece of software that will, too, inevitably fail because of hardware or internal issues. Failure happens. That doesn’t mean, however, that an application or an architecture or a network is unreliable. Reliability is not an absence of failure in the data center, it’s a measure of how quickly such failures can be compensated for. Being able to rely upon an application does not mean it never fails, it simply means that such failures as do occur are corrected for or otherwise addressed quickly, before they have an impact on the availability of the application. And that means the entire application delivery chain. ARCHITECTURAL RELIABILITY We’ve been building application delivery architectures for a while now with a key design goal in mind: built to fail. We assume that any given piece of the architecture might go belly up at any time, and architect a solution that takes that into consideration to ensure that availability is never (or as minimally as possible) impacted. Hardware. Software. Any given piece of a critical system should be able to fail without negatively impacting availability. Performance may degrade, but availability itself is maintained. This often takes the form of “standby” systems; duplicates of a given infrastructure or application service that, in the event of a failure, are ready to stand in for the primary and continue doing what needs to be done. They’re the second-stringers, the bench warmers, the idle resources that are the devil’s playground in the data center. And we’re getting rid of them. As we optimize the data center for cost and efficiency, we’re eliminating the redundant duplication (see what I did there?) within the architecture and replacing it with something more aligned with the business goals of maximizing the return on investment associated with all that hardware and software that makes the business go. We’re automating fail-over processes that no longer assume a secondary exists: instead, we automatically provision a new primary in the event of a failure from the much larger pool of resources that were once reserved. We’re modifying the notion of architectural reliability to mean we don’t need to fail-over, we’ll just fail-through instead. And that works, except when it doesn’t. SINGLE-POINTS of FAILURE The danger here is two-fold: first, that we will run short of resources and be unable to handle any failure and second, that we can guarantee that the provisioning process can occur nearly simultaneously with the failure. It can’t. At least not yet. And while we’re getting quite good at leveraging intelligent health monitoring and collaborative infrastructure architectures, we still haven’t figured out how to predict a failure. Auto-scaling works because it does not account for failure. It assumes infinite resources and consistent availability. We can tell when an application is nearing capacity and adjust the resources accordingly before it becomes necessary. And it is exactly that “before” that is important in maintaining availability and thus providing a reliable application. But we can’t predict a failure, and thus we can’t know when to begin provisioning until it’s essentially too late. There are only two viable solutions: pre-provisioning, which defeats the purpose of such real-time automation and scalability services in the first place and reserved resources, which can have a deleterious affect on efficiency and costs – you’re purposefully creating a pool of idle resources again. Both tactics have the same effect: idle resources waiting to be needed which runs contrary to one of the desired intents of implementing a virtualized or cloud computing-based infrastructure in the first place. Thus, the definition of reliability as it pertains to our new, agile and cloud-based applications is directly related to the longest time required to either replace or provision any comprised component. Single points of failure, you see, are very bad for reliability. Especially when they are virtualized and it may the case that there are no resources available that can be used to “replace” the failed component. This is particularly important to note as we start to virtualize the infrastructure. It sounds like a good deal: virtual network appliances dramatically decrease the CapEx associated with such investments, but operationally you still have the same challenges to address. You still need a redundant system and they must reside on physically separate systems, in case the hardware upon which the virtual network appliance is deployed itself fails. That’s true as well for applications; redundancy must be system-wide which means two instances of the same application on the same physical device invites unreliability. And when you realize that you’re going to need a physical system for every instance of a virtual network appliance, you might start wondering why it was that you virtualized them in the first place. Especially when you consider you exchanged nearly instantaneous serial-based fail-over for pretty fast network-based failure and a largely reduced capacity per instance. And of course any gains provided by purpose-built hardware acceleration that cannot easily (or cheaply) be duplicated in a virtualized environment. Oh, and let’s not forget the potential of creating a single point of failure where there was none by eliminating the fail-to-wire option of so many infrastructure components. Almost every proxy-based network component “fails to wire” in the event of a failure resulting in a loss of functionality but not the ability to pass data, which means availability of the application is not compromised in the event a failure, although security or other functionality might be.Yes, you gained architectural multi-tenancy and simplified provisioning, but the need for such an implementation is quickly being erased by the rush by vendors to provide true multi-tenancy for network-based infrastructure and many of the gains in provisioning can be achieved using the same infrastructure 2.0 capable methods (APIs, SDKs) that are used to integrate virtual form factors. The ability to react quickly, for agile operations, depends heavily on underlying architectural decisions. And it is the ability to react nearly instantaneously to failures throughout the entire infrastructure that enables a reliable, consistent application. Consider carefully the pros and cons of virtualization in every aspect of a deployment as it relates specifically to reliability with an eye toward aligning architectural decisions with business and operational requirements. This includes the business making decisions regarding “mission critical” applications. Not every application is mission critical, and understanding which applications are truly vital will go a long way toward cutting costs in infrastructure and management. A mission-critical application reliability requirement of 100% will likely remove some components from the virtualization list and potentially impact decisions regarding resource allocation/reservation systems. Single points of failure must be eliminated in critical application delivery chains to ensure reliability. Failure will happen, eventually, and a reliable infrastructure takes that into account and ensures a timely response as a means to avoid downtime and its associated costs. Data Center Feng Shui Operational Risk Comprises More Than Just Security The Number of the Counting Shall be Three (Rules of Thumb for Application Availability) Data Center Feng Shui: Fault Tolerance and Fault Isolation All Data Center Feng Shui posts on DevCentral Architectural Multi-tenancy The Question Shouldn’t Be Where are the Network Virtual Appliances but Where is the Architecture? I CAN HAS DEFINISHUN of SoftADC and vADC? The Devil is in the Details VM Sprawl is Bad but Network Sprawl is Badder173Views0likes0CommentsQuarantine First to Mitigate Risk of VM App Stores
Internal processes may be the best answer to mitigating risks associated with third-party virtual appliances The enterprise data center is, in most cases, what aquarists would call a “closed system.” This is to say that from a systems and application perspective, the enterprise has control over what goes in. The problem is, of course, those pesky parasites (viruses, trojans, worms) that find their way in. This is the result of allowing external data or systems to enter the data center without proper security measures. For web applications we talk about things like data scrubbing and web application firewalls, about proper input validation codified by developers, and even anti-virus scans of incoming e-mail. But when we start looking at virtual appliances, at virtual machines, being hosted in “vm stores” much in the same manner as mobile applications are hosted in “app stores” today, the process becomes a little more complicated. Consider Stuxnet as a good example of the difficulty in completely removing some of these nasty contagions. Now imagine public AMIs or other virtual appliances downloaded from a “virtual appliance store”. Hoff first raised this as a potential threat vector a while back, and reintroduced it when it was tangentially raised by Google’s announcement it had “pulled 21 popular free apps from the Android Market” because “the apps are malware aimed at getting root access to the user’s device.” Hoff continues to say: This is going to be a big problem in the mobile space and potentially just as impacting in cloud/virtual datacenters as people routinely download and put into production virtual machines/virtual appliances, the provenance and integrity of which are questionable. Who’s going to police these stores? -- Christofer Hoff, “App Stores: From Mobile Platforms To VMs – Ripe For Abuse” Even if someone polices these stores, are you going to run the risk, ever so slight as it may be, that a dangerous pathogen may be lurking in that appliance? We had some similar scares back in the early days of open source, when a miscreant introduced a trojan into a popular open source daemon that was subsequently downloaded, compiled, and installed by a lot of people. It’s not a concept with which the enterprise is unfamiliar. THE DATA CENTER QUARANTINE (TANK) I cannot count the number of desperate pleas for professional advice and help with regards to “sick fish” that start with: I did not use a quarantine tank. A quarantine tank (QT) in the fish keeping hobby is a completely separate (isolated) tank maintained with the same water parameters as the display tank (DT). The QT provides a transitory stop for fish destined for the display tank that offers a chance for the fish to become acclimated to the water and light parameters of the system while simultaneously allowing the hobbyist to observe the fish for possible signs of infection. Interestingly, the QT is used before an infection is discovered, not just afterwards as is the case with people infected with highly contagious diseases. The reason fish are placed into quarantine even though they may be free of disease or parasites is because they will ultimately be placed into a closed system and it is nearly impossible to eradicate disease and parasites in a closed system without shutting it all down first. To avoid that catastrophic event, fish go into QT first and then, when it’s clear they are healthy, they can join their new friends in the display tank. Now, the data center is very similar to a closed system. Once a contagion gets into its systems, it can be very difficult to eradicate it. While there are many solutions to preventing contagion, one of the best solutions is to use a quarantine “tank” to ensure health of any virtual appliance prior to deployment. Virtualization affords organizations the ability to create a walled-garden, an isolated network environment, that is suitable for a variety of uses. Replicating production environments for testing and validation of topology and architecture is often proposed as the driver for such environments, but use as a quarantine facility is also an option. Quarantine is vital to evaluating the “health” of any virtual network appliance because you aren’t looking just for the obvious – worms and trojans that are detectable using vulnerability scans – but you’re looking for the stealth infection. The one that only shows itself at certain times of the day or week and which isn’t necessarily as interested in propagating itself throughout your network but is instead focused on “phoning home” for purposes of preparing for a future attack. It’s necessary to fire up that appliance in a constrained environment and then watch it. Monitor its network and application activity over time to determine whether or not it’s been infected with some piece of malware that only rears its ugly head when it thinks you aren’t looking. Within the confines of a quarantined environment, within the ‘turn it off and start it over clean’ architecture comprised of virtual machines, you have the luxury of being able to better evaluate the health of any third-party virtual machine (or application for that matter) before turning it loose in your data center. QUARANTINE in the DATA CENTER is not NEW The idea of quarantine in the data center is not new. We’ve used it for some time as an assist in dealing with similar situations; particularly end-users infected with some malware detectable by end-user inspection solutions. Generally we’ve used that information to quarantine the end-user on a specific network with limited access to data center resources – usually just enough to clean their environment or install the proper software necessary to protect them. We’ve used a style of quarantine to aid in the application lifecycle progression from development to deployment in production in the QA or ‘test’ phase wherein applications are deployed into an environment closely resembling the production environment as a means to ensure that configurations, dependencies and integrations are properly implemented and the application works as expected. So the concept is not new, it’s more the need to recognize the benefits of a ‘quarantine first’ policy and subsequently implementing such a process in the data center to support the use of third-party virtual network appliances. As with many cloud and virtualization-related challenges, part of the solution almost always involves process. It is in recognizing the challenges and applying the right mix of process, product and people to mitigate operational risks associated with the deployment of new technology and architectures. Cloud Control Does Not Always Mean ‘Do it yourself’ App Stores: From Mobile Platforms To VMs – Ripe For Abuse Operational Risk Comprises More Than Just Security The Strategy Not Taken: Broken Doesn’t Mean What You Think It Means Cloud Chemistry 101 More Users, More Access, More Clients, Less Control Get Your Money for Nothing and Your Bots for Free Control, choice, and cost: The Conflict in the Cloud The Corollary to Hoff’s Law264Views0likes0CommentsThe Goldfish Effect
When you combine virtualization with auto-scaling without implementing proper controls you run the risk of scaling yourself silly or worse – broke. You virtualized your applications. You set up an architecture that supports auto-scaling (on-demand) to free up your operators. All is going well, until the end of the month. Applications are failing. Not just one, but all of them. After hours of digging into operational dashboards and logs and monitoring consoles you find the problem: one of the applications, which experiences extremely heavy processing demands at the end of the month, has scaled itself out too far and too fast for its environment. One goldfish has gobbled up the food and has grown too large for its bowl. It’s not as crazy an idea as it might sound at first. If you haven’t implemented the right policies in the right places in your shiny new on-demand architecture, you might just be allowing for such a scenario to occur. Whether it’s due to unforeseen legitimate demand or a DoS-style attack without the right limitations (policies) in place to ensure that an application has scaling boundaries you might inadvertently cause a denial of service and outages to other applications by eliminating resources they need. Automating provisioning and scalability is a Good Thing. It shifts the burden from people to technology, and it is often only through the codification of the processes IT follows in a more static, manual network to scale an application can inefficiencies be discovered and subsequently eliminated. But an easily missed variable in this equation are limitations that were once imposed by physical containment. An application can only be scaled out as far as its physical containers, and no further. Virtualization breaks applications free from its physical limitations and allows it to ostensibly scale out across a larger pool of compute resources located in various physical nooks and crannies across the data center. But when you virtualized resources you will need to perform capacity planning in a new way. Capacity planning becomes less about physical resources and more about costs and priorities for processing. It becomes a concerted effort to strike a balance between applications in such a way that resources are efficiently used based on prioritization and criticalness to the business rather than what’s physically available. It becomes a matter of metering and budgets and factoring costs into the auto-scaling process.155Views0likes1CommentSessions, Sessions Everywhere
If you’re replicating session state across application servers you probably need to rethink your strategy. There’s other options – more efficient options – than wasting RAM and, ultimately, money. Although the discussion of Oracle’s “cloud in a box” announcement at OpenWorld dominated much of the tweet-stream this week there were other discussions going on that proved to not only interesting but a good reminder of how cloud computing has brought to the fore the importance of architecture. Foremost in my mind was what started as a lamentation on the fact that Amazon EC2 does not support multicasting that evolved into a discussion on why that would cause grief for those deploying applications in the environment. Remember that multicast is essentially spraying the same data to a group of endpoints and is usually leveraged for streaming media topologies: In computer networking, multicast is the delivery of a message or information to a group of destination computers simultaneously in a single transmission from the source creating copies automatically in other network elements, such as routers, only when the topology of the network requires it. -- Wikipedia, multicast As it turns out, a primary reason behind the need for multicasting in the application architecture revolves around the mirroring of session state across a pool of application servers. Yeah, you heard that right – mirroring session state across a pool of application servers. The first question has to be: why? What is it about an application that requires this level of duplication? MULTICASTING for SESSIONS There are three reasons why someone would want to use multicasting to mirror session state across a pool of application servers. There may be additional reasons that aren’t as common and if so, feel free to share. The application relies on session state and, when deployed in a load balanced environment, broke because the tight-coupling between user and session state was not respected by the Load balancer. This is a common problem when moving from dev/qa to production and is generally caused by using a load balancing algorithm without enabling persistence, a.k.a. sticky sessions. The application requires high-availability that necessitates architecting a stateful-failover architecture. By mirroring sessions to all application servers if one fails (or is decommissioned in an elastic environment) another can easily re-establish the coupling between the user and their session. This is not peculiar to application architecture – load balancers and application delivery controllers mirror their own “session” state across redundant pairs to achieve a stateful failover architecture as well. Some applications, particularly those that are collaborative in nature (think white-boarding and online conferences) “spray” data across a number of sessions in order to enable the sharing in real time aspect of the application. There are other architectural choices that can achieve this functionality, but there are tradeoffs to all of them and in this case it is simply one of several options. THE COST of REPLICATING SESSIONS With the exception of addressing the needs of collaborative applications (and even then there are better options from an architectural point of view) there are much more efficient ways to handle the tight-coupling of user and session state in an elastic or scaled-out environment. The arguments against multicasting session state are primarily around resource consumption, which is particularly important in a cloud computing environment. Consider that the typical session state is 3-200 KB in size (Session State: Beyond Soft State ). Remember that if you’re mirroring every session across an entire cluster (pool) of application servers, that each server must use memory to store that session. Each mirrored session, then, is going to consume resources on every application server. Every application server has, of course, a limited amount of memory it can utilize. It needs that memory for more than just storing session state – it must also store connection tables, its own configuration data, and of course it needs memory in which to execute application logic. If you consume a lot of the available memory storing the session state from every other application server, you are necessarily reducing the amount of memory available to perform other important tasks. This reduces the capacity of the server in terms of users and connections, it reduces the speed with which it can execute application logic (which translates into reduced response times for users), and it operates on a diminishing returns principle. The more application servers you need to scale – and you’ll need more, more frequently, using this technique – the less efficient each added application server becomes because a good portion of its memory is required simply to maintain session state of all the other servers in the pool. It is exceedingly inefficient and, when leveraging a public cloud computing environment, more expensive. It’s a very good example of the diseconomy of scale associated with traditional architectures – it results in a “throw more ‘hardware’ at the problem, faster” approach to scalability. BETTER ARCHITECTURAL SOLUTIONS There are better architectural solutions to maintaining session state for every user. SHARED DATABASE Storing session state in a shared database is a much more efficient means of mirroring session state and allows for the same guarantees of consistency when experiencing a failure. If session state is stored in a database then regardless of which application server instance a user is directed to that application server has access to its session state. The interaction between the user and application becomes: User sends request Clustering/load balancing solution routes to application server Application server receives request, looks up session in database Application server processes request, creates response Application server stores updated session in database Application server returns response If a single database is problematic (because it is a single point of failure) then multicasting or other replication techniques can be used to implement a dual-database architecture. This is somewhat inefficient, but far less so than doing the same at the application server layer. PERSISTENCE-BASED LOAD BALANCING It is often the case that the replication of session state is implemented in response to wonky application behavior occurring only when the application is deployed in a scalable environment, a.k.a a load balancing solution is introduced into the architecture. This is almost always because the application requires tight-coupling between user and session and the load balancing is incorrectly configured to support this requirement. Almost every load balancing solution – hardware, software, virtual network appliance, infrastructure service – is capable of supporting persistence, a.k.a. sticky sessions. This solution requires, however, that the load balancing solution of choice be configured to support the persistence. Persistence (also sometimes referred to as “server affinity” when implemented by a clustering solution) can be configured in a number of ways. The most common configuration is to leverage the automated session IDs generated by application servers, e.g. PHPSESSIONID, ASPSESSIONID. These ids are contained in the HTTP headers and are, as a matter of fact, how the application server “finds” the appropriate session for any given user’s request. The load balancer intercepts every request (it does anyway) and performs the same type of lookup on its own session table (which is much, much higher capacity than an application server and leverages the same high-performance lookups used to store connection and network session tables) and routes the user to the appropriate application server based on the session ID. The interaction between the user and application becomes: User sends request Clustering/load balancing solution finds, if existing, the session-app server mapping. If it does not, it chooses the application server based on the load balancing algorithm and configured parameters Application server receives request, Application server processes request, creates response Application server returns response Clustering/load balancing solution creates the session-app server mapping if it did not already exist Persistence can generally be based on any data in the HTTP header or payload, but using the automatically generated session ids tends to be the most common implementation. YOUR INFRASTRUCTURE, GIVE IT TO ME Now, it may be the case when the multicasting architecture is the right one. It is impossible to say it’s never the right solution because there are always applications and specific scenarios in which an architecture that may not be a good idea in general is, in fact, the right solution. It is likely the case, however, in most situations that it is not the right solution and has more than likely been implemented as a workaround in response to problems with application behavior when moving through a staged development environment. This is one of the best reasons why the use of a virtual edition of your production load balancing solution should be encouraged in development environments. The earlier a holistic strategy to application design and architecture can be employed the fewer complications will be experienced when the application moves into the production environment. Leveraging a virtual version of your load balancing solution during the early stages of the development lifecycle can also enable developers to become familiar with production-level infrastructure services such that they can employ a holistic, architectural approach to solving application issues. See, it’s not always because developers don’t have the know how, it’s because they don’t have access to the tools during development and therefore can’t architect a complete solution. I recall a developer’s plaintive query after a keynote at [the now defunct] SD West conference a few years ago that clearly indicated a reluctance to even ask the network team for access to their load balancing solution to learn how to leverage its services in application development because he knew he would likely be denied. Network and application delivery network pros should encourage the use of and tinkering with virtual versions of application delivery controllers/load balancers in the application development environment as much as possible if they want to reduce infrastructure and application architectural-related issues from cropping up during production deployment. A greater understanding of application-infrastructure interaction will enable more efficient, higher performing applications in general and reduce the operational expenses associated with deploying applications that use inefficient methods such as replication of session state to address application architectural constraints. Related blogs & articles: Applying Scalability Patterns to Infrastructure Architecture Scalability Only One Half the Reliability Equation Service Virtualization Helps Localize Impact of Elastic Scalability Web 2.0: Integration, APIs, and Scalability Automating scalability and high availability services To Take Advantage of Cloud Computing You Must Unlearn, Luke. Scalability with multiple networks for Virtual Servers ... Cloud Lets You Throw More Hardware at the Problem Faster And That, Young Cloudwalker, Is Why You Fail600Views0likes0CommentsData Center Feng Shui: Normalizing Phased Deployment with Virtualized Network Appliances
Normalizing deployment environments from dev through production can eliminate issues earlier in the application lifecycle, speed time to market, and gives devops the means by which their emerging discipline can mature with less risk. One of the big “trends” in cloud computing is to use a public cloud as an alternative environment for development and test. On the surface, this makes sense and is certainly a cost effective means of managing the highly variable environment that is development. But unless you can actually duplicate the production environment in a public cloud, the benefits might be offset by the challenges of moving through the rest of the application lifecycle. NORMALIZATION LEADS to GREATER EFFICIENCIES One of the reasons developers don’t have an exact duplicate of the production environment is cost. Configuration aside, the cost of the hardware and software duplication across a phased deployment environment is simply too high for most organizations. Thus, developers are essentially creating applications in a vacuum. This means as they move through the application deployment phases they are constantly barraged with new and, shall we say, interesting situations caused or exposed by differences in the network and application delivery network. Example: One of the most common problems that occurs when moving an application into a scalable production environment revolves around persistence (stickiness). Developers, not having the benefit of testing their creation in a load balanced environment, may not be aware of the impact of a Load balancer on maintaining the session state of their application. A load balancer, unless specifically instructed to do so, does not care about session state. This is also true, in case you were thinking of avoiding this by going “public” cloud, in a public cloud. It’s strictly a configuration thing, but it’s a thing that is often overlooked. This causes problems when developers or customers start testing the application and discover it’s acting “wonky”. Depending on the configuration of the load balancer, this wonkiness (yes, that is a technical term, thank you very much) can manifest in myriad ways and it can take precious time to pinpoint the problem and implement the proper solution. The solution should be trivial (persistence/sticky sessions based on a session id that should be automatically generated and inserted into the HTTP headers by the application server platform) but may not be. In the event of the latter it may take time to find the right unique key upon which to persist sessions and in some few cases may require a return to development to modify the application appropriately. This is all lost time and, because of the way in which IT works, lost money. It’s also possibly lost opportunity and mindshare if the application is part of an organization’s competitive advantage. Now, assume that the developer had a mirror image of the production environment. S/He could be developing in the target environment from the start. These little production deployment “gotchas” that can creep up will be discovered early on as the application is being tested for accuracy of execution, and thus time lost to troubleshooting and testing in production is effectively offset by what is a more agile methodology. DEVELOPING DEVOPS as a DISCIPLINE Additionally developers can begin to experiment with other infrastructure services that may be available but were heretofore unknown (and therefore untrusted). If a developer can interact with infrastructure services in development, testing and playing with the services to determine which ones are beneficial and which ones may not, they can develop a more holistic approach to application delivery and control the way in which the network interacts with their application. That’s a boon for the operations and network teams, too, as they are usually unfamiliar with the application and must take time to learn its nuances and quirks and adjust/fine-tune the network and application delivery network to meet the needs of the application. If the developer has already performed these tasks, the only thing left for the ops and network teams is to implement and verify the configuration. If the two networks – production and virtual production – are in synch this should eliminate the additional time necessary and make the deployment phase of the application lifecycle less painful. If not developers, ops, or network teams, then devops can certainly benefit from a “dev” environment themselves in which they can hone their skills and develop the emerging discipline that is devops. Devops requires integration and development of automation systems that include infrastructure which means devops will need the means to develop those systems, scripts, and applications used to integrate infrastructure into the operational management in production environments. Like developers, this is an iterative and ongoing process that probably shouldn’t use production as an experimental environment. Thus, devops, too, will increasingly find a phased and normalized (commoditized) deployment approach a benefit to developing their libraries and skills. This assumes the use of virtual network appliances (VNA) in the development environment. Unfortunately the vast majority of hardware-only solutions are not available as VNAs today which makes a perfect mirrored copy of production at this time unrealistic. But for those pieces of the infrastructure that are available as a VNA, it should be an option to deploy them as a copy of production as the means to enable developers to better understand the relationship between their application and the infrastructure required to deliver and secure it. Infrastructure services that most directly impact the application – load balancers, caches, application acceleration, and web application firewall – should be mirrored into development for use by developers as often as possible because it is most likely that they will be the cause of some production-level error or behavioral quirk that needs to be addressed. The bad news is that if there are few VNAs with which to mirror the production environment there are even fewer that can be/are available in a public cloud environment. That means that the cost-savings associated with developing “in the cloud” may be offset by the continuation of a decades old practice which results in little more than a game of “throw the application over the network wall.” Related Posts All Data Center Feng Shui posts on DevCentral Sessions and Cookies and Persistence, oh my! Cloud Computing: Is Your Cloud Sticky? It Should Be from tag hardware Cloud Lets You Throw More Hardware at the Problem Faster Don’t Throw the Baby out with the Bath Water When Did Specialized Hardware Become a Dirty Word? IT Myths and Legends: No One Understands Our Legacy Software Virtual Network Infrastructure: Virtually Good Enough? 64 Things Every Geek Should Know - LaptopLogic.com Hardware Acceleration Critical Component for Cost-Conscious Data Centers from tag application 3 Really good reasons you should use TCP multiplexing Packets, shmackets. What about the apps? The Unpossible Task of Eliminating Risk The third greatest (useful) hack in the history of the Web 4 things you can do in your code now to make it more scalable later (more..) del.icio.us Tags: MacVittie,F5,data center feng shui,virtualization,virtual network appliance,hardware,architecture,application,lifecycle,cloud computing,devops192Views0likes0CommentsData Center Feng Shui: SSL
Like most architectural decisions the choice between hardware and virtual server are not mutually exclusive. The argument goes a little something like this: The increases in raw compute power available in general purpose hardware eliminates the need for purpose-built hardware. After all, if general purpose hardware can sustain the same performance for SSL as purpose-built (specialized) hardware, why pay for the purpose-built hardware? Therefore, ergo, and thusly it doesn’t make sense to purchase a hardware solution when all you really need is the software, so you should just acquire and deploy a virtual network appliance. The argument, which at first appears to be a sound one, completely ignores the fact that the same increases in raw compute power for general purpose hardware are also applicable to purpose-built hardware and the specialized hardware cards that provide acceleration of specific functions like compression and RSA operations (SSL). But for the purposes of this argument we’ll assume that performance, in terms of RSA operations per second, are about equal between the two options. That still leaves two very good situations in which a virtualized solution is not a good choice. 1 COMPLIANCE with FIPS 140 For many industries, federal government, banking, and financial services among the most common, SSL is a requirement – even internal to the organization. These industries also tend to fall under the requirement that the solution providing SSL be FIPS 140-2 or higher compliant. If you aren’t familiar with FIPS or the different “levels” of security it specifies, then let me sum up: FIPS 140 Level 2 (FIPS 140-2) requires a level of physical security that is not a part of Level 1 beyond the requirement that hardware components be “production grade”, which we assume covers the general purpose hardware deployed by cloud providers. Security Level 2 improves upon the physical security mechanisms of a Security Level 1 cryptographic module by requiring features that show evidence of tampering, including tamper-evident coatings or seals that must be broken to attain physical access to the plaintext cryptographic keys and critical security parameters (CSPs) within the module, or pick-resistant locks on covers or doors to protect against unauthorized physical access. -- FIPS 140-2, Wikipedia FIPS 140-2 requires specific physical security mechanisms to ensure the security of the cryptographic keys used in all SSL (RSA) operations. The private and public keys used in SSL, and its related certificates, are essentially the “keys to the kingdom”. The loss of such keys is considered quite the disaster because they can be used to (a) decrypt sensitive conversations/transactions in flight and (b) masquerade as the provider by using the keys and certificates to make more authentic phishing sites. More recently keys and certificates, PKI (Public Key Infrastructure), has been an integral component of providing DNSSEC (DNS Security) as a means to prevent DNS cache poisoning and hijacking, which has bitten several well-known organizations in the past two years. Obviously you have no way of ensuring or even knowing if the general purpose compute upon which you are deploying a virtual network appliance has the proper security mechanisms necessary to meet FIPS 140-2 compliance. Therefore, ergo, and thusly if FIPS Level 2 or higher compliance is a requirement for your organization or application, then you really don’t have the option to “go virtual” because such solutions cannot meet the physical requirements necessary. 2 RESOURCE UTILIZATION A second consideration, assuming performance and sustainable SSL (RSA) operations are equivalent, is the resource utilization required to sustain that level of performance. One of the advantages of purpose built hardware that incorporates cryptographic acceleration cards is that it’s like being able to dedicate CPU and memory resources just for cryptographic functions. You’re essentially getting an extra CPU, it’s just that the extra CPU is automatically dedicated to and used for cryptographic functions. That means that general purpose compute available for TCP connection management, application of other security and performance-related policies, is not required to perform the cryptographic functions. The utilization of general purpose CPU and memory necessary to sustain X rate of encryption and decryption will be lower on purpose-built hardware than on its virtualized counterpart. That means while a virtual network appliance can certainly sustain the same number of cryptographic transactions it may not (likely won’t) be able to do much other than that. The higher the utilization, too, the bigger the impact on performance in terms of latency introduced into the overall response time of the application. You can generally think of cryptographic acceleration as “dedicated compute resources for cryptography.” That’s oversimplifying a bit, but when you distill the internal architecture and how tasks are actually assigned at the operating system level, it’s an accurate if not abstracted description. Because the virtual network appliance must leverage general purpose compute for what are computationally expensive and intense operations, that means there will be less general purpose compute for other tasks, thereby lowering the overall capacity of the virtualized solution. That means in the end the costs to deploy and run the application are going to be higher in OPEX than CAPEX, while the purpose-built solution will be higher in CAPEX than in OPEX – assuming equivalent general purpose compute between the virtual network appliance and the purpose-built hardware. IS THERE EVER A GOOD TIME to GO VIRTUAL WHEN SSL is INVOLVED? Can you achieve the same performance gains by running a virtual network appliance on general purpose compute hardware augmented by a cryptographic acceleration module? Probably, but that assumes that the cryptographic module is one with which the virtual network appliance is familiar and can support via hardware drivers and part of the “fun” of cloud computing and leased compute resources is that the underlying hardware isn’t supposed to be a factor and can vary from cloud to cloud and even from machine to machine within a cloud environment. So while you could achieve many of the same performance gains if the cryptographic module were installed on the general purpose hardware (in fact that’s how we used to do it, back in the day) it would complicate the provisioning and management of the cloud computing environment which would likely raise the costs per transaction, defeating one of the purposes of moving to cloud in the first place. If you don’t need FIPS 140-2 or higher compliance, if performance and capacity (and therefore costs) are not a factor, if you’re simply using the virtualized network appliance as part of test or QA efforts or a proof of concept, certainly – go virtual. If you’re building a hybrid cloud implementation you’re likely to need a hybrid application delivery architecture. The hardware components provide the fault tolerance and reliability required while virtual components offer the means by which corporate policies and application-specific behavior can be deployed to external cloud environments with relative ease. Cookie security and concerns over privacy may require encrypted connections regardless of application location, and in the case where physical deployment is not possible or feasible (financially) then a virtual equivalent to provide that encryption layer is certainly a good option. It’s important to remember that this is not necessarily a mutually exclusive choice. A well-rounded cloud computing strategy will likely include both hardware and virtual network appliances in a hybrid architecture . IT needs to formulate a strong strategy and guidance regarding what applications can and cannot be deployed in a public cloud computing environment, and certainly the performance/capacity and compliance requirements of a given application in the context of its complete architecture – network, application delivery network, and security - should be part of that decision making process. The question whether to go virtual or physical is not binary. The operator is, after all, OR and not XOR. The key is choosing the right form factor for the right environment based on both business and operational needs.197Views0likes1CommentF5 Friday: The Rules for the Game of Application Performance Tag
It’s an integration thing. One of the advantages of deploying an application delivery controller (ADC) instead of a regular old Load balancer is that it is programmable – or at least it is if it’s an F5 BIG-IP. That means you have some measure of control over application data as it’s being delivered to end-users and can manipulate that data in various ways depending on the context of the request and the response. While an ADC has insight into the end-user environment – from network connection type and conditions to platform and location – and can therefore make adjustments to delivery policies dynamically based on that information, it can’t gather the kind of end-to-end application performance metrics both business stakeholders and developers need. This data, the end-user view of application performance, is increasingly important. Back in April Google noted that “page speed” would be incorporated into its ranking algorithm, with faster loading pages being ranked higher than slower pages. And we all know that the increasingly digital generation joining the ranks of corporate end-users have grown accustomed to fast, responsive web applications no matter what device they’re using or from what location they’re coming. Application performance is a big deal. The trick, however, is that you have to know how fast (or slow) your pages are loading before you can do something about them. That generally means employing the services of an application performance monitoring solution like Keynote or Gomez (now a part of Compuware). The way in which these solutions collect application performance data is through instrumentation of web applications, i.e. every page for which you want a measurement must be modified to include a small piece of Javascript that enables the reporting of performance back to the central system. Administrators and business stakeholders can then generate reports (and/or alerts) based on that data. IT’S an INVESTMENT The time investment required to instrument every page for which you (or the business) desire metrics is significant, especially if the instrumentation is occurring after deployment rather than as a part of the development life cycle process. Instrumentation at any time incurs the risk of error, too, as its generally done manually in a programmatic way. With any code-based solution – whether operational script or part of an application – there’s always the chance of making a mistake. In the case of web applications that mistake can be more costly as its visible to the end-user (customer) and may cause the application to be “unavailable”. It’s no surprise, then, that while most agree on the importance of understanding and measuring application performance, many organizations do not place a high priority on actually implementing a solution. dynaTrace recently conducted a study on performance management in large and small companies. The quick facts paint a horrible picture. 6o percent of the companies admit that they do not have any performance management processes installed or what they have is ineffective. Half of the companies who answered that they have performance management processes admitted that they are doing it only in a reactive way when problems occur. One third of all companies said that management is not supporting performance management properly. From this data we can obviously conclude that performance management is not a primary interest in most companies. Week 22 – Is There a Business Case for Application Performance? While the data from the dynaTrace study is interesting, it ignores how the reality of implementing APM solutions impact the ability and/or desire to deploy such solutions. One of the reasons many companies have no performance management processes is because they are unable to instrument the application in the first place. For some, it’s because the application is “packaged”; it’s a closed source, third-party application that can’t be instrumented via traditional code-based extension. For others it may be the case that instrumentation is possible, but the application is frequently updated and instrumentation must be done with every update and necessarily extends the application development lifecycle. AUTOMATED INSTRUMENTATION for ALL APPLICATIONS It is at this point that the game of application performance “tag” comes in handy and alleviates the risk associated with manually instrumenting pages and enables the monitoring of packaged and other closed-source applications. Even though we often refer to BIG-IP as a “load balancer” it really is an application delivery controller. It’s a platform, imbued with the ability to programmatically modify application data on-demand. Using iRules as a network-side scripting solution, architects and developers and administrators can control and manage application data in real-time without requiring modification to applications. It is by leveraging this capability that we are able to automatically inject the instrumentation code required by Gomez to monitor application performance. The BIG-IP “tags” application response data with the appropriate end-user monitoring code which enables the ability for APM providers like Gomez to monitor any application. This is particularly interesting (and useful) when you consider the inherent difficulties of measuring performance not only from packaged applications but from an off-premise cloud computing environment. By deploying a virtual BIG-IP (a virtual network appliance) the same network-side script used to inject the appropriate client-side script to enable application performance monitoring can be used to instrument off-premise cloud computing deployed applications on-demand. This means consistent application performance measurements of applications regardless of location. Even if the application “moves” from the local data center to an off-premise cloud, the functionality can remain simply by including a virtual BIG-IP with the appropriate iRules deployed. The GOMEZ SOLUTION F5 has partnered with Gomez to make the process of implementing this instrumentation for their service almost trivial. The nature of iRules, however, makes it possible to duplicate this effort with any application performance monitoring solution which relies upon a client-side script to collect performance metrics from end-users. The programmatic nature of iRules is such that organizations will enjoy complete control over the instrumentation, including the ability to extend the functionality such that the injection of the Javascript happens only in certain situations. For example, it might be done based on the type of client or client network conditions. It might be only injected for a new end-user as determined by the existence of a cookie. It might be injected only for certain pages of an application. It can, essentially, be based on any contextual variable to which the ADC, the BIG-IP, has access. You can read more details (including the code and the specific “rules” for implementation) in this tech tip from Joe Pruitt, “Automated Gomez Performance Monitoring” and you can download the full implementation in the iRules CodeShare under GomezInjection.272Views0likes0CommentsI CAN HAS DEFINISHUN of SoftADC and vADC?
In the networking side of the world, vendors often seek to differentiate their solutions not just based on features and functionality, but on form-factor, as well. Using a descriptor to impart an understanding of the deployment form-factor of a particular solution has always been quite common: appliance, hardware, platform, etc… Sometimes these terms come from analysts, other times they come from vendors themselves. Regardless of where they originate, they quickly propagate and unfortunately often do so without the benefit of a clear definition. A reader recently asked a question that reminded me that we’ve done just that as we cloud computing and virtualization creep into our vernacular. Quite simply the question was, “What’s the definition of a Soft ADC and vADC?” That’s actually an interesting question as it’s more broadly applicable than just to ADCs. For example, the last several years we’ve been hearing about “Soft WOC (WAN Optimization Controller)” in addition to just plain old WOC and the definition of Soft WOC is very similar to Soft ADC. The definitions are, if not well understood and often used, consistent across the entire application delivery realm – from WAN to LAN to cloud. So this post is to address the question in relation to ADC more broadly, as there’s an emerging “xADC” model that should probably be mentioned as well. Let’s start with the basic definition of an Application Delivery Controller (ADC) and go from there, shall we? ADC An application delivery controller is a device that is typically placed in a data center between the firewall and one or more application servers (an area known as the DMZ). First- generation application delivery controllers primarily performed application acceleration and handled load balancing between servers. The latest generation of application delivery controllers handles a much wider variety of functions, including rate shaping and SSL offloading, as well as serving as a Web application firewall. If you said an application delivery controller was a “load balancer on steroids” (which is how I usually describe them to the uninitiated) you wouldn’t be far from the truth. The core competency of an ADC is load balancing, and from that core functionality has been derived, over time, the means by which optimization, acceleration, security, remote access, and a wealth of other functions directly related to application delivery in scalable architectures can be applied in a unified fashion. Hence the use of the term “Unified Application Delivery.” If you prefer a gaming metaphor, an application delivery controller is like a multi-classed D&D character, probably a 3e character because many of the “extra” functions available in an ADC are more like skills or feats than class abilities. SOFT ADC So a "Soft ADC" then is simply an ADC in software format, deployed on commodity hardware. That hardware may or may not have additional hardware processing (like PCI-based SSL acceleration) to assist in offloading compute intense processes and the integration of the software with that hardware varies from vendor to vendor. Soft ADCs are sometimes offered as “softpliances” (many people hate this term) or an “appliance comprised of commodity hardware pre-loaded and configured with the ADC software.” This option allows the vendor to harden and optimize the operating system on which the Soft ADC runs, which can be advantageous to the organization as it will not need to worry about upgrades and/or patches to the solution impacting the functionality of the Soft ADC. This option can also result in higher capacity and better performance for the ADC and the applications it manages, as the operating system’s network stack is often “tweaked” and “tuned” to support the application delivery functions of the Soft ADC. VIRTUAL ADC (vADC) A "vADC" is a virtualized version of an ADC. The ADC may or may not have first been a "Soft ADC", as in the case of BIG-IP which is not available as a "Soft ADC" but is available as a traditional hardware ADC or a virtual ADC. vADCs are ADCs deployed in a virtual network appliance (VNA) form factor, as an image compatible with modern virtual machines (VMware, Xen, Hyper-V). ADC as a SERVICE There is an additional "type" of ADC emerging mainly because of proprietary virtual image formats in clouds like Amazon, the "ADC as a service" which is offered as a provisionable service within a specific cloud computing environment that is not portable (or usable) outside the environment. In all other respects the “ADC as a service” is indistinguishable from the vADC as it, too, is deployed on commodity hardware and lacks integration with the underlying hardware platform or available acceleration chipsets. A PLACE for EVERYTHING and EVERYTHING in its PLACE In the general category of application delivery (and most networking solutions as well) we can make the following abstractions regarding these definitions: “Solution” Soft “Solution” v”Solution” “Solution” as a Service* A traditional hardware-based “solution” A traditional hardware-based solution in a software form-factor that can be deployed on an “appliance” or commodity hardware A traditional hardware-based solution in a virtualized form-factor that can be deployed as a virtual network appliance (VNA) on a variety of virtualization platforms. A traditional hardware-based solution in a proprietary form-factor (software or virtual) that is not usable or portable outside the environment in which it is offered. So if we were to tackle “Soft WOC”, as well, we’d find that the general definition – traditional hardware-based solution in a software form-factor – also fits that category of solution well. It may seem to follow logically than any version of an ADC (or network solution) is “as good” as the next given that the core functionality is almost always the same regardless of form factor. There are, however, pros and cons to each form-factor that should be taken into consideration when designing an architecture that may take advantage of an ADC. In some cases a Soft ADC or vADC will provide the best value, in others a traditional hardware ADC, and in many cases the highly-scalable and flexible architecture will take advantage of both in the appropriate places within the architecture. *Some solutions offered “as a service” are more akin to SaaS in that they are truly web services, regardless of underlying implementation, that are “portable” because they can be accessed from anywhere, though they cannot be “moved” or integrated internally as private solutions.279Views0likes2Comments