The Applications of Our Lives
The Internet of Things will soon become The Internet of Nouns There are a few 'The ______ of Our Lives' out there: Days. Time. Moments. Love. They define who we are, where we've been and where we are going. And today, many of those days, times, moments and loves interact with applications. Both the apps we tap and the back end applications used to chronicle these events have become as much a part of our lives as the happenings themselves. The app, Life. As reported on umpteen outlets yesterday, Twitter went down for about an hour. As news broke, there were also some fun headlines like, Twitter goes down, chaos and productivity ensue, Twitter is down. NFL free agency should be postponed, Twitter is down, let the freak-out commence and Twitter goes down, helps man take note it’s his wife’s birthday. It is amazing how much society has come to rely on social media to communicate. Another article, Why Twitter Can’t Keep Crashing, goes right into the fact that it is globally distributed, real-time information delivery system and how the world has come to depend on it, not just to share links and silly jokes but how it affects lives in real ways. Whenever Facebook crashes for any amount of time people also go crazy. Headlines for that usually read something like, 'Facebook down, birthdays/anniversaries/parties cease to exist!' Apparently since people can't tell, post, like, share or otherwise bullhorn their important events, it doesn't actually occur. 'OMG! How am I gonna invite people to my bash in two weeks without social media?!? My life is over!' Um, paper, envelopes, stamps anyone? We have connected wrist bracelets keeping track of our body movements, connected glasses recording every move, connected thermostats measuring home environments and pretty much any other 'thing' that you want to monitor, keep track of or measure. From banking to buying, to educating to learning, to connecting to sharing and everything in between, our lives now rely on applications so much so, that when an application is unavailable, our lives get jolted. Or, we pause our lives for the moment until we can access that application. As if we couldn't go on without it. My, how application availability has become critical to our daily lives. I think The Internet of Things will soon become The Internet of Nouns since every person, place or thing will be connected. I like that. I call 'The Internet of Nouns' as our next frontier! Sorry adverbs, love ya but you're not connected. ps Related Does Social Media Reflect Society? The Icebox Cometh The Top 10, Top 10 Predictions for 2014 The Internet of Things and DNS Delivering the Internet of Things Technorati Tags: apps,applications,social media,life,availability,twitter,facebook,society,humans,people,silva,f5,iot Connect with Peter: Connect with F5:359Views0likes0CommentsWhy Layer 7 Load Balancing Doesn’t Suck
Alternative title: Didn’t We Resolve This One 10 Years Ago? There’s always been a bit of a disconnect between traditional network-focused ops and more modern, application-focused ops when it comes to infrastructure architecture. The arguments for an against layer 7 (application) load balancing first appeared nearly a decade ago when the concept was introduced. These arguments were being tossed around at the same time we were all arguing for or against various QoS (Quality of Service) technologies, namely the now infamous “rate shaping” versus “queuing” debate. If you missed that one, well, imagine an argument similar to that today of “public” versus “private” clouds. Same goals, different focus on how to get there. Seeing this one crop up again is not really any more surprising than seeing some of the other old debates bubbling to the surface. cloud computing and virtualization have brought network infrastructure and its capabilities, advantages and disadvantages – as well as its role in architecting a dynamic data center – to the fore again. But seriously? You’d think the arguments would have evolved in the past ten years. While load balancing hardware marketing execs get very excited about the fact that their product can magically scale your application by using amazing Layer 7 technology in the Load balancer such as cookie inserts and tracking/re-writing. What they fail to mention is that any application that requires the load balancer to keep track of session related information within the communications stream can never ever be scalable or reliable. -- Why Layer 7 load balancing sucks… I myself have never shied away from mentioning the session management capabilities of a full-proxy application delivery controller (ADC) and, in fact, I have spent much time advocating on behalf of its benefits to architectural flexibility, scalability, and security. Second, argument by selective observation is a poor basis upon which to make any argument, particularly this one. While persistence-based load balancing is indeed one of the scenarios in which an advanced application delivery controller is often used (and in some environments, such as scaling out VDI deployments, is a requirement) this is a fairly rudimentary task assigned to such devices. The use of layer 7 routing, aka page routing in some circles (such as Facebook), is a highly desirable capability to have at your disposable. It enables more flexible, scalable architectures by creating scalability domains based on application-specific functions. There are a plethora of architectural patterns that leverage the use of an intelligent, application-aware intermediary (i.e. layer 7 load balancing capable ADC). Here’s a few I’ve written myself, but rest assured there are many more examples of infrastructure scalability patterns out there on the Internets: Infrastructure Architecture: Avoiding a Technical Ambush Infrastructure Scalability Pattern: Partition by Function or Type Applying Scalability Patterns to Infrastructure Architecture Infrastructure Scalability Pattern: Sharding Streams Interestingly, ten years ago this argument may have held some water. This was before full-proxy application delivery controllers were natively able to extract the necessary HTTP headers (where cookies are exchanged). That means in the past, such devices had to laboriously find and extract the cookie and its value from the textual representation of the headers, which obviously took time and processing cycles (i.e. added latency). But that’s not been the case since oh, 2004-2005 when such capabilities were recognized as being a requirement for most organizations and were moved into native processing, which reduced the impact of such extraction and evaluation to a negligible, i.e. trivial, amount of delay and effectively removed as an objection this argument. THE SECURITY FACTOR Both this technique and trivial tasks like tracking and re-writing do, as pointed out, require session tracking – TCP session tracking, to be exact. Interestingly, it turns out that this is a Very Good Idea TM from a security perspective as it provides a layer of protection against DDoS attacks at lower levels of the stack and enables an application-aware ADC to more effectively thwart application-layer DDoS attacks, such as SlowLoris and HTTP GET floods. A simple, layer 4 load balancing solution (one that ignores session tracking, as is implied by the author) can neither recognize nor defend against such attacks because they maintain no sense of state. Ultimately this means the application and/or web server is at high risk of being overwhelmed by even modest attacks, because these servers are incapable of scaling sessions at the magnitude required to sustain availability under such conditions. This is true in general of high volumes of traffic or under overwhelming loads due to processor-intense workloads. A full-proxy device mitigates many of the issues associated with concurrency and throughput simply by virtue of its dual-stacked nature. SCALABILITY As far as scalability goes, really? This is such an old and inaccurate argument (and put forth with no data, to boot) that I’m not sure it’s worth presenting a counter argument. Many thousands of customers use application delivery controllers to perform layer 7 load balancing specifically for availability assurance. Terminating sessions does require session management (see above), but to claim that this fact is a detriment to availability and business continuity shows a decided lack of understanding of failover mechanisms that provide near stateful-failover (true stateful-failover is impossible at any layer). The claim that such mechanisms require network bandwidth indicates either a purposeful ignorance with respect to modern failover mechanisms or simply failure to investigate. We’ve come a long way, baby, since then. In fact, it is nigh unto impossible for a simple layer 4 load balancing mechanism to provide the level of availability and consistency claimed, because such an architecture is incapable of monitoring the entire application (which consists of all instances of the application residing on app/web servers, regardless of form-factor). See axiom #3 (Context-Aware) in “The Three Axioms of Application Delivery”. The Case (For & Against) Network-Driven Scalability in Cloud Computing Environments The Case (For & Against) Management-Driven Scalability in Cloud Computing Environments The Case (For & Against) Application-Driven Scalability in Cloud Computing Environments Resolution to the Case (For & Against) X-Driven Scalability in Cloud Computing Environments The author’s claim seems to rest almost entirely on the argument “Google does layer 4 not layer 7” to which I say, “So?” If you’re going to tell part of the story, tell the entire story. Google also does not use a traditional three-tiered approach to application architecture, nor do its most demanding services (search) require state, nor does it lack the capacity necessary to thwart attacks (which cannot be said for most organizations on the planet). There is a big difference between modern, stateless applications (and many benefits) and traditional three-tiered application architectures. Now it may in fact be the case that regardless of architecture an application and its supporting infrastructure do not require layer 7 load balancing services. In that case, absolutely – go with layer 4. But to claim layer 7 load balancing is not scalable, or resilient, or high-performance enough when there is plenty of industry-wide evidence to prove otherwise is simply a case of not wanting to recognize it.669Views0likes0CommentsThe Real News is Not that Facebook Serves Up 1 Trillion Pages a Month…
It’s how much load that really generates and how it scales to meet the challenge. There’s some amount of debate whether Facebook really crossed over the one trillion page view per month threshold. While one report says it did, another respected firm says it did not; that its monthly page views are a mere 467 billion per month. In the big scheme of things, the discrepancy is somewhat irrelevant, as neither show the true load on Facebook’s infrastructure – which is far more impressive a set of numbers than its externally measured “page view” metric. Mashable reported in “Facebook Surpasses 1 Trillion Pageviews per Month” that the social networking giant saw “approximately 870 million unique visitors in June and 860 million in July” and followed up with some per visitor statistics, indicating “each visitor averaged approximately 1,160 page views in July and 40 per visit — enormous by any standard. Time spent on the site was around 25 minutes per user.” From an architectural standpoint it’s not just about the page views. It’s about requests and responses, many of which occur under the radar from metrics and measurements typically gathered by external services like Google. Much of Facebook’s interactive features are powered by AJAX, which is hidden “in” the page and thus obscured from external view and a “page view” doesn’t necessarily include a count of all the external objects (scripts, images, etc…) that comprises a “page”. So while 1 trillion (or 467 billion, whichever you prefer) is impressive, consider that this is likely only a fraction of the actual requests and responses handled by Facebook’s massive infrastructure on any given day. Let’s examine what the actual requests and responses might mean in terms of load on Facebook’s infrastructure, shall we? SOME QUICK MATH Loading up Facebook yields 125 requests to load various scripts, images, and content. That’s a “page view”. Sitting on the page for a few minutes and watching Firebug’s console, you’ll note a request to update content occurs approximately every minute you are on a page. If we do the math – based on approximate page views per visitor, each of which incurs 125 GET requests – we can math that up to an approximation of 19,468 RPS (Requests per Second). That’s only an approximation, mind you, and doesn’t take into consideration the time factor, which also incurs AJAX-based requests to update content occurring on a fairly regular basis. These also add to the overall load on Facebook’s massive infrastructure. And that’s before we start considering the impact from “unseen” integrated traffic via Facebook’s API which, according to the most recently available data (2009) was adding 5 billion requests a day to that load. If you’re wondering, that’s an additional 57,870 requests per second, which gives us a more complete number of 77,338 requests per second. SOURCE: 2009 Interop F5 Keynote Let’s take a moment to digest that, because that’s a lot of load on a site – and I’m sure it still isn’t taking into consideration everything. We also have to remember that the load at any given time could be higher – or lower – based on usage patterns. Averaging totals over a month and distilling down to a per second average is just that – a mathematical average. It doesn’t take into consideration that peaks and valleys occur in usage throughout the day and that Facebook may be averaging only a fraction of that load with spikes two and three times as high throughout the day. That realization should be a bit sobering, as we’ve seen recent DDoS attacks that have crippled and even toppled sites with less traffic than Facebook handles in any given minute of the day. The question is, how do they do it? How do they manage to keep the service up and available despite the overwhelming load and certainty of traffic spikes? IT’S the ARCHITECTURE Facebook itself does a great job of discussing exactly how it manages to sustain such load over time while simultaneously managing growth, and its secret generally revolves around architectural choices. Not just the “Facebook” application architecture, but its use of infrastructure architecture as well. That may not always be apparent from Facebook’s engineering blog, which generally focuses on application and software architecture topics, but it is inherent in those architectural decisions. Take, for example, an engineer’s discussion on Facebook’s secrets to scaling to over 500 million users and beyond. The very first point made is to “scale horizontally”. This isn't at all novel but it's really important. If something is increasing exponentially, the only sensible way to deal with it is to get it spread across arbitrarily many machines. Remember, there are only three numbers in computer science: 0, 1, and n. (Scaling Facebook to 500 Million Users and Beyond (Facebook Engineering Blog)) Horizontal scalability is, of course, enabled via load balancing which generally (but not always) implies infrastructure components that are critical to an overall growth and scalability strategy. The abstraction afforded by the use of load balancing services also has the added benefit of enabling agile operations as it becomes cost and time effective to add and remove (provision and decommission) compute resources as a means to meet scaling challenges on-demand, which is a key component of cloud computing models. In other words, in addition to Facebook’s attention to application architecture as a means to enable scalability, it also takes advantage of infrastructure components providing load balancing services to ensure that its massive load is distributed not just geographically but efficiently across its various clusters of application functionality. It’s a collaborative architecture that spans infrastructure and application tiers, taking advantage of the speed and scalability benefits afforded by both approaches simultaneously. Yet Facebook is not shy about revealing its use of infrastructure as a means to scale and implement its architecture; you just have to dig around to find it. Consider as an example of a collaborative architecture the solution to some of the challenges Facebook has faced trying to scale out its database, particularly in the area of synchronization across data centers. This is a typical enterprise challenge made even more difficult by Facebook’s decision to separate “write” databases from “read” to enhance the scalability of its application architecture. The solution is found in something Facebook engineers call “Page Routing” but most of us in the industry call “Layer 7 Switching” or “Application Switching”: The problem thus boiled down to, when a user makes a request for a page, how do we decide if it is "safe" to send to Virginia or if it must be routed to California? This question turned out to have a relatively straightforward answer. One of the first servers a user request to Facebook hits is called a Load balancer; this machine's primary responsibility is picking a web server to handle the request but it also serves a number of other purposes: protecting against denial of service attacks and multiplexing user connections to name a few. This load balancer has the capability to run in Layer 7 mode where it can examine the URI a user is requesting and make routing decisions based on that information. This feature meant it was easy to tell the load balancer about our "safe" pages and it could decide whether to send the request to Virginia or California based on the page name and the user's location. (Scaling Out (Facebook Engineering Blog)) That’s the hallmark of the modern, agile data center and the core of cloud computing models: collaborative, dynamic infrastructure and applications leveraging technology to enable a cost-efficient, scalable architectures able to maintain growth along with the business. SCALABILITY TODAY REQUIRES a COMPREHENSIVE ARCHITECTURAL STRATEGY Today’s architectures – both application and infrastructure – are growing necessarily complex to meet the explosive growth of a variety of media and consumers. Applications alone cannot scale themselves out – there simply aren’t physical machines large enough to support the massive number of users and load on applications created by the nearly insatiable demand consumers have for online games, shopping, interaction, and news. Modern applications must be deployed and delivered collaboratively with infrastructure if they are to scale and support growth in an operationally and financially efficient manner. Facebook’s ability to grow and scale along with demand is enabled by its holistic, architectural approach that leverages both modern application scalability patterns as well as infrastructure scalability patterns. Together, infrastructure and applications are enabling the social networking giant to continue to grow steadily with very few hiccups along the way. Its approach is one that is well-suited for any organization wishing to scale efficiently over time with the least amount of disruption and with the speed of deployment required of today’s demanding business environments. Facebook Hits One Trillion Page Views? Nope. Facebook Surpasses 1 Trillion Pageviews per Month Scaling Out (Facebook Engineering Blog) Scaling Facebook to 500 Million Users and Beyond (Facebook Engineering Blog) WILS: Content (Application) Switching is like VLANs for HTTP Layer 7 Switching + Load Balancing = Layer 7 Load Balancing Infrastructure Scalability Pattern: Partition by Function or Type Infrastructure Scalability Pattern: Sharding Sessions Architecturally, Is There Such A Thing As Too Scalable? Forget Hyper-Scale. Think Hyper-Local Scale.210Views0likes0CommentsThe Level of Uptime – Increasing Pressure Syndrome
When I was earning my bachelors, I joined the Association for Computing Machinery (ACM) and through them, several special interest groups. One of those groups was SIGRISK, which focused on high-risk software engineering. At the time the focus was on complex systems whose loss was irretrievable – like satellite guidance systems or deep sea locomotion systems – and those whose failure could result in death or injury to individuals – like power plant operations systems, medical equipment, and traffic light systems. The approach to engineering these controls was rigorous, more rigorous than most IT staff would consider reasonable. And the reason was simple. As we’ve seen since – more than once - a space ship that has one line of bad code can end up veering off course and never returning, the data it collects completely different than that which it was designed to collect. Traffic lights that are mis-programmed offer a best-case of traffic snarls and people late for whatever they were doing, and a worst case of fatal accidents. These systems were categorized by the ACM as high-risk, and special processes were put in place to handle high-risk software development (interestingly, processes that don’t appear to have been followed by space agencies – who did a lot of the writing and presenting for SIGRISK when I was a member). Interestingly, SIGRISK no longer shows up on the list of SIGs at the ACM website. It is the only SIG I belonged to that seems to have gone away or been merged into something else. What interests me in all of this is a simple truth that I’ve noticed of late. We are now making the same expectations of everyday software that we made of these over-engineered systems designed to function even in the face of complete failure. And they’re not designed for this level of protection. Think about it a bit, critical medical systems can be locked down so that the only interface is the operator’s interface, and upgrades are only allowed with a specific hardware key, things being launched into space don’t require serious protection from hackers, they’re a way out of reach, traffic lights have been hacked, but they’re not easy, and the public nature of the interfaces makes it difficult to pull off in busy times… But Facebook and Microsoft? They have massive interfaces, global connectedness, and by definition IT staff tweaking them constantly. Configuration, new features, uncensored third party development… The mind spins. Ariane 5 courtesy of SpaceFlightNow.com Makes me wonder if Apple (and to a lesser extent RIM) wasn’t smart to lock down development. RIM has long had a “you have a key for all of your apps, if you want to touch protected APIs, your app will have your key, and if you are a bad kid and crash our phones, we’ll shut off your key”. Okay, that last bit might be assumed, it’s been a while since I read the agreement (I’ve got a RIM dev license), but that was the impression I was left with three years ago when I read through the documentation. Apple took a lot of grief for their policies, but seriously, they want their phone to work. Note that Microsoft often gets blamed for problems caused by “rogue” applications. But it doesn’t address the issue of software stability in a highly exposed, highly dynamic environment. We’re putting pressures on IT folks – who are already under time pressure – that used to be reserved for scientists in laboratories. And we’re expecting them to get it right with an impatience more indicative of a two year old than a pool of adults. Every time a big vendor has a crash or a security breech, we act like they’re idiots. Truth is that they have highly complex systems that are exposed to both inexperienced users and experienced hackers, and we don’t give them the years of development time that critical systems get. So what’s my point? When you’re making demands of your staff, yes, business needs and market timing are important, but give them time to do their job right, or don’t complain about the results. And in an increasingly connected enterprise, don’t assume that some back-office corner piece of software/hardware is less critical than user-facing systems. After all, the bug that bit Microsoft not too long ago was a misconfiguration in lab systems. I’ve worked in a test lab before, and they’re highly volatile. When big tests are going on, the rest of the architecture can change frequently while things are pulled in and returned from the big test, complete wipe and reconfigure is common – from switches to servers – and security was considered less important than delivering test results. And the media attention lavished on the Facebook outage in September is enough that you’d think people had died from the failure… Which was caused by a software configuration change. www.TechCrunch.com graphic of Facebook downtime Nice and easy. Don’t demand more than can be delivered, or you’ll get sloppy work, both in App Dev and in Systems Management. Use process to double-check everything, making sure that it is right. Better to take an extra day or even ten than to find your application down and people unable to do anything. Because while Microsoft and Facebook can apologize and move on, internal IT rarely gets off that easily. Automation tools like those presented by the Infrastructure 2.0 crowd (Lori is one of them) can help a lot, but in the end, people are making changes, even if they’re making them through a push button on a browser… Make sure you’ve got a plan to make it go right, and an understanding of how you’ll react if it doesn’t. And the newly coined “DevOps” hype-word might be helpful too – where Dev meets Operations is a good place to start building in those checks.158Views0likes0CommentsWARNING: Security Device Enclosed
If you aren’t using all the security tools at your disposal you’re doing it wrong. How many times have you seen an employee wave on by a customer when the “security device enclosed” in some item – be it DVD, CD, or clothing – sets off the alarm at the doors? Just a few weeks ago I heard one young lady explain the alarm away with “it must have be the CD I bought at the last place I was at…” This apparently satisfied the young man at the doors who nodded and turned back to whatever he’d been doing. All the data the security guy needed to make a determination was there; he had all the context necessary in which to analyze the situation and make a determination based upon that information. But he ignored it all. He failed to leverage all the tools at his disposal and potentially allowed dollars to walk out the door. In doing so he also set a precedent and unintentionally sent a message to anyone who really wanted to commit a theft: I ignore warning signs, go ahead.1.7KViews0likes2Comments