Intro to Load Balancing for Developers – The Algorithms
If you’re new to this series, you can find the complete list of articles in the series on my personal page here If you are writing applications to sit behind a Load Balancer, it behooves you to at least have a clue what the algorithm your load balancer uses is about. We’re taking this week’s installment to just chat about the most common algorithms and give a plain- programmer description of how they work. While historically the algorithm chosen is both beyond the developers’ control, you’re the one that has to deal with performance problems, so you should know what is happening in the application’s ecosystem, not just in the application. Anything that can slow your application down or introduce errors is something worth having reviewed. For algorithms supported by the BIG-IP, the text here is paraphrased/modified versions of the help text associated with the Pool Member tab of the BIG-IP UI. If they wrote a good description and all I needed to do was programmer-ize it, then I used it. For algorithms not supported by the BIG-IP I wrote from scratch. Note that there are many, many more algorithms out there, but as you read through here you’ll see why these (or minor variants of them) are the ones you’ll see the most. Plain Programmer Description: Is not intended to say anything about the way any particular dev team at F5 or any other company writes these algorithms, they’re just an attempt to put the process into terms that are easier for someone with a programming background to understand. Hopefully a successful attempt. Interestingly enough, I’ve pared down what BIG-IP supports to a subset. That means that F5 employees and aficionados will be going “But you didn’t mention…!” and non-F5 employees will likely say “But there’s the Chi-Squared Algorithm…!” (no, chi-squared is theoretical distribution method I know of because it was presented as a proof for testing the randomness of a 20 sided die, ages ago in Dragon Magazine). The point being that I tried to stick to a group that builds on each other in some connected fashion. So send me hate mail… I’m good. Unless you can say more than 2-5% of the world’s load balancers are running the algorithm, I won’t consider that I missed something important. The point is to give developers and software architects a familiarity with core algorithms, not to build the worlds most complete lexicon of algorithms. Random: This load balancing method randomly distributes load across the servers available, picking one via random number generation and sending the current connection to it. While it is available on many load balancing products, its usefulness is questionable except where uptime is concerned – and then only if you detect down machines. Plain Programmer Description: The system builds an array of Servers being load balanced, and uses the random number generator to determine who gets the next connection… Far from an elegant solution, and most often found in large software packages that have thrown load balancing in as a feature. Round Robin: Round Robin passes each new connection request to the next server in line, eventually distributing connections evenly across the array of machines being load balanced. Round Robin works well in most configurations, but could be better if the equipment that you are load balancing is not roughly equal in processing speed, connection speed, and/or memory. Plain Programmer Description: The system builds a standard circular queue and walks through it, sending one request to each machine before getting to the start of the queue and doing it again. While I’ve never seen the code (or actual load balancer code for any of these for that matter), we’ve all written this queue with the modulus function before. In school if nowhere else. Weighted Round Robin (called Ratio on the BIG-IP): With this method, the number of connections that each machine receives over time is proportionate to a ratio weight you define for each machine. This is an improvement over Round Robin because you can say “Machine 3 can handle 2x the load of machines 1 and 2”, and the load balancer will send two requests to machine #3 for each request to the others. Plain Programmer Description: The simplest way to explain for this one is that the system makes multiple entries in the Round Robin circular queue for servers with larger ratios. So if you set ratios at 3:2:1:1 for your four servers, that’s what the queue would look like – 3 entries for the first server, two for the second, one each for the third and fourth. In this version, the weights are set when the load balancing is configured for your application and never change, so the system will just keep looping through that circular queue. Different vendors use different weighting systems – whole numbers, decimals that must total 1.0 (100%), etc. but this is an implementation detail, they all end up in a circular queue style layout with more entries for larger ratings. Dynamic Round Robin (Called Dynamic Ratio on the BIG-IP): is similar to Weighted Round Robin, however, weights are based on continuous monitoring of the servers and are therefore continually changing. This is a dynamic load balancing method, distributing connections based on various aspects of real-time server performance analysis, such as the current number of connections per node or the fastest node response time. This Application Delivery Controller method is rarely available in a simple load balancer. Plain Programmer Description: If you think of Weighted Round Robin where the circular queue is rebuilt with new (dynamic) weights whenever it has been fully traversed, you’ll be dead-on. Fastest: The Fastest method passes a new connection based on the fastest response time of all servers. This method may be particularly useful in environments where servers are distributed across different logical networks. On the BIG-IP, only servers that are active will be selected. Plain Programmer Description: The load balancer looks at the response time of each attached server and chooses the one with the best response time. This is pretty straight-forward, but can lead to congestion because response time right now won’t necessarily be response time in 1 second or two seconds. Since connections are generally going through the load balancer, this algorithm is a lot easier to implement than you might think, as long as the numbers are kept up to date whenever a response comes through. These next three I use the BIG-IP name for. They are variants of a generalized algorithm sometimes called Long Term Resource Monitoring. Least Connections: With this method, the system passes a new connection to the server that has the least number of current connections. Least Connections methods work best in environments where the servers or other equipment you are load balancing have similar capabilities. This is a dynamic load balancing method, distributing connections based on various aspects of real-time server performance analysis, such as the current number of connections per node or the fastest node response time. This Application Delivery Controller method is rarely available in a simple load balancer. Plain Programmer Description: This algorithm just keeps track of the number of connections attached to each server, and selects the one with the smallest number to receive the connection. Like fastest, this can cause congestion when the connections are all of different durations – like if one is loading a plain HTML page and another is running a JSP with a ton of database lookups. Connection counting just doesn’t account for that scenario very well. Observed: The Observed method uses a combination of the logic used in the Least Connections and Fastest algorithms to load balance connections to servers being load-balanced. With this method, servers are ranked based on a combination of the number of current connections and the response time. Servers that have a better balance of fewest connections and fastest response time receive a greater proportion of the connections. This Application Delivery Controller method is rarely available in a simple load balancer. Plain Programmer Description: This algorithm tries to merge Fastest and Least Connections, which does make it more appealing than either one of the above than alone. In this case, an array is built with the information indicated (how weighting is done will vary, and I don’t know even for F5, let alone our competitors), and the element with the highest value is chosen to receive the connection. This somewhat counters the weaknesses of both of the original algorithms, but does not account for when a server is about to be overloaded – like when three requests to that query-heavy JSP have just been submitted, but not yet hit the heavy work. Predictive: The Predictive method uses the ranking method used by the Observed method, however, with the Predictive method, the system analyzes the trend of the ranking over time, determining whether a servers performance is currently improving or declining. The servers in the specified pool with better performance rankings that are currently improving, rather than declining, receive a higher proportion of the connections. The Predictive methods work well in any environment. This Application Delivery Controller method is rarely available in a simple load balancer. Plain Programmer Description: This method attempts to fix the one problem with Observed by watching what is happening with the server. If its response time has started going down, it is less likely to receive the packet. Again, no idea what the weightings are, but an array is built and the most desirable is chosen. You can see with some of these algorithms that persistent connections would cause problems. Like Round Robin, if the connections persist to a server for as long as the user session is working, some servers will build a backlog of persistent connections that slow their response time. The Long Term Resource Monitoring algorithms are the best choice if you have a significant number of persistent connections. Fastest works okay in this scenario also if you don’t have access to any of the dynamic solutions. That’s it for this week, next week we’ll start talking specifically about Application Delivery Controllers and what they offer – which is a whole lot – that can help your application in a variety of ways. Until then! Don.21KViews1like9CommentsDoes Cloud Solve or Increase the 'Four Pillars' Problem?
It has long been said – often by this author – that there are four pillars to application performance: Memory CPU Network Storage As soon as you resolve one in response to application response times, another becomes the bottleneck, even if you are not hitting that bottleneck yet. For a bit more detail, they are “memory consumption” – because this impacts swapping in modern Operating Systems. “CPU utilization” – because regardless of OS, there is a magic line after which performance degrades radically. “Network throughput” – because applications have to communicate over the network, and blocking or not (almost all coding for networks today is), the information requested over the network is necessary and will eventually block code from continuing to execute. “Storage” – because IOPS matter when writing/reading to/from disk (or the OS swaps memory out/back in). These four have long been relatively easy to track. The relationship is pretty easy to spot, when you resolve one problem, one of the others becomes the “most dangerous” to application performance. But historically, you’ve always had access to the hardware. Even in highly virtualized environments, these items could be considered both at the Host and Guest level – because both individual VMs and the entire system matter. When moving to the cloud, the four pillars become much less manageable. The amount “much less” implies depends a lot upon your cloud provider, and how you define “cloud”. Put in simple terms, if you are suddenly struck blind, that does not change what’s in front of you, only your ability to perceive it. In the PaaS world, you have only the tools the provider offers to measure these things, and are urged not to think of the impact that host machines may have on your app. But they do have an impact. In an IaaS world you have somewhat more insight, but as others have pointed out, less control than in your datacenter. Picture Courtesy of Stanley Rabinowitz, Math Pro Press. In the SaaS world, assuming you include that in “cloud”, you have zero control and very little insight. If you app is not performing, you’ll have to talk to the vendors’ staff to (hopefully) get them to resolve issues. But is the problem any worse in the cloud than in the datacenter? I would have to argue no. Your ability to touch and feel the bits is reduced, but the actual problems are not. In a pureplay public cloud deployment, the performance of an application is heavily dependent upon your vendor, but the top-tier vendors (Amazon springs to mind) can spin up copies as needed to reduce workload. This is not a far cry from one common performance trick used in highly virtualized environments – bring up another VM on another server and add them to load balancing. If the app is poorly designed, the net result is not that you’re buying servers to host instances, it is instead that you’re buying instances directly. This has implications for IT. The reduced up-front cost of using an inefficient app – no matter which of the four pillars it is inefficient in – means that IT shops are more likely to tolerate inefficiency, even though in the long run the cost of paying monthly may be far more than the cost of purchasing a new server was, simply because the budget pain is reduced. There are a lot of companies out there offering information about cloud deployments that can help you to see if you feel blind. Fair disclosure, F5 is one of them, I work for F5. That’s all you’re going to hear on that topic in this blog. While knowing does not always directly correlate to taking action, and there is some information that only the cloud provider could offer you, knowing where performance bottlenecks are does at least give some level of decision-making back to IT staff. If an application is performing poorly, looking into what appears to be happening (you can tell network bandwidth, VM CPU usage, VM IOPS, etc, but not what’s happening on the physical hardware) can inform decision-making about how to contain the OpEx costs of cloud. Internal cloud is a much easier play, you still have access to all the information you had before cloud came along, and generally the investigation is similar to that used in a highly virtualized environment. From a troubleshooting performance problems perspective, it’s much the same. The key with both virtualization and internal (private) clouds is that you’re aiming for maximum utilization of resources, so you will have to watch for the bottlenecks more closely – you’re “closer to the edge” of performance problems, because you designed it that way. A comprehensive logging and monitoring environment can go a long way in all cloud and virtualization environments to keeping on top of issues that crop up – particularly in a large datacenter with many apps running. And developer education on how not to be a resource hog is helpful for internally developed apps. For externally developed apps the best you can do is ask for sizing information and then test their assumptions before buying. Sometimes, cloud simply is the right choice. If network bandwidth is the prime limiting factor, and your organization can accept the perceived security/compliance risks, for example, the cloud is an easy solution – bandwidth in the cloud is either not limited, or limited by your willingness to write a monthly check to cover usage. Either way, it’s not an Internet connection upgrade, which can be dastardly expensive not just at install, but month after month. Keep rocking it. Get the visibility you need, don’t worry about what you don’t need. Related Articles and Blogs: Don MacVittie - Load Balancing For Developers Advanced Load Balancing For Developers. The Network Dev Tool Load Balancers for Developers – ADCs Wan Optimization ... Intro to Load Balancing for Developers – How they work Intro to Load Balancing for Developers – The Gotchas Intro to Load Balancing for Developers – The Algorithms Load Balancing For Developers: Security and TCP Optimizations Advanced Load Balancers for Developers: ADCs - The Code Advanced Load Balancing For Developers: Virtual Benefits Don MacVittie - ADCs for Developers Devops Proverb: Process Practice Makes Perfect Devops is Not All About Automation 1024 Words: Why Devops is Hard Will DevOps Fork? DevOps. It's in the Culture, Not Tech. Lori MacVittie - Development and General Devops: Controlling Application Release Cycles to Avoid the ... An Aristotlean Approach to Devops and Infrastructure Integration How to Build a Silo Faster: Not Enough Ops in your Devops233Views0likes0CommentsNew Communications = Multiplexification
I wrote a good while back about the need to translate all the various storage protocols into one that could take root and simplify the lives of IT. None of the ones currently being hawked seem to be making huge inroads in the datacenter, all have some uses, none is unifying. Those peddling the latest, greatest thing of course want to sell you on their protocol because they hope to be The One, but it’s not about selling, it’s about useful. At the time FCoE was the new thing. I don’t get much chance to follow storage like I used to, but I haven’t heard of anything new since the furor over FCoE started to calm down, so presume the market is still sitting there, with NAS split between two, and block storage split between many. There is a similar fragmentation trend going on in networking at the moment too. There have always been a zillion transport standards, and as long as the upper layers can be uniform, working out how to fit your cool new satellite link into Ethernet is a simple problem from the IT perspective. Either the vendor solves the issue or they fail due to lack of usefulness. But higher layers are starting to see fragmentation. In the form of SPDY, Speed + mobility, etc. In both of these cases, HTTP is being supplanted by something that requires configuration differences and is not universally supported by clients. And yet the benefits are such that IT is paying attention. IPv6 is causing similar issues at the lower layers, and it is worth mentioning here for a reason. The key, as Lori and I have both written, is that IT cannot afford to rework everything at once to support these new standards, but feels an imperative (for IP address space from IPv6, for web app performance for the http layer changes) to implement them whenever possible. The best solution to these problems – where upgrading has its costs and failing to upgrade has other costs – is to implement a gateway. F5s IPv6 Gateway is one solution (other vendors have them too - I’ll talk about the one I know here, but assume it applies to the others and verify that with your vendor) that’s easy to talk about because it is being utilized in IT shops to do just that. With the gateway implemented, sitting in front of your DC, it translates from IPv6 to IPv4, meaning that the datacenter can be converted at a sane pace, and support for IPv4 is not a separate stack that must be maintained while client adoption catches up. If a connection comes in to the gateway, if it is IPv4 and the server speaks IPv4, the connection is passed through. The same occurs if both client and server support IPv6. If the client and server have a mismatch, the gateway translates between them. That means you get support the day a gateway is deployed, and over time can transfer your systems while maintaining support for all clients. This type of solution works handily for protocols like SPDY too – offering the ability to say a server supports SPDY when in fact it doesn’t, the gateway does and translates between SPDY and HTTP. Deploying a SPDY gateway gives instant SPDY support to web (and application) servers behind the gateway, buying IT time to reconfigure those web servers to actually support SPDY. SPDY accelerates everything on the client side, and http is only used on the faster server side where the network is dedicated. Faster has an asterisk by it though. What if the app or web server is at a remote site? You’re going right back out onto the Internet and using HTTP unoptimized. In those cases – and other cases where network response time is slow - something is needed on the backend to keep those performance gains without finding the next bottleneck as soon as the SPDY gateway is deployed. F5 uses several technologies to improve backend communications performance, and other vendors have similar solutions (though ours are better – biased though I may be). For F5’s part, secure tunnels, WAN optimization, and a very relevant feature of BIG-IP LTM called OneConnect all work together to minimize backend traffic. OneConnect is a cool little feature that minimizes the connections from the BIG-IP to the backend server by pooling and reusing them. This process does several things, but importantly, it takes setup and teardown time for connections out of the picture. So if a (non-SPDY) client makes four connections to get its data, the BIG-IP merges them with other requests to the same server and essentially multiplexes them. Funny thing is, this is one of the features of SPDY on the other side, with the primary difference that SPDY is client focused (merges connections from the client), and OneConnect is server focused (merges connections to the server). The client side is “all connections from this client”, while the server side is “all connections to this server (regardless of client)”, but otherwise they are very similar. This enters interesting territory, because now we’re essentially multi-multi-plexing. But we’re not. Here’s a simple diagram utilizing only a couple of clients and generic server/application farm to try and show the sequence of events: 1. SPDY comes into a gateway as a single stream from the client 2. The gateway translates into HTTP’s multiple streams 3. BIG-IP identifies the server the request is for 4. If a connection exists to the server, BIG-IP passes the request through the existing connection 5. When responses are sent, this process is handled in reverse. Responses come in over OneConnect and go out SPDY encoded. There is only a brief period of time where native HTTP is being communicated, and presumably the SPDY gateway and the BIG-IP are in very close proximity. The result is application communications that are optimized end-to-end, but the only changes to your application architecture are configuring the SPDY Gateway and OneConnect. Not too bad for a problem that normally requires modification of each web and application servers that will support SPDY. As alluded to above, if the application servers are remote from the SPDY Gateway, the benefits are even more pronounced, just due to latency on the back end. All the benefits of both SPDY and OneConnect, and you will be done before lunch. Far better than loading modules into every webserver or upgrading every app server. Alternatively, you could continue to support only HTTP, but watching the list of clients that transparently support SPDY, the net result of doing so is very likely to be that customers gravitate to your competitors whose websites seem to be faster. The Four V’s of Big Data The “All of the Above” Approach to Improving Application Performance Google SPDY Accelerates Mobile Web192Views0likes0CommentsAdvanced Load Balancing For Developers. The Network Dev Tool
It has been a while since I wrote an installment of Load Balancing for Developers, and now I think it has been too long, but never fear, this is the grad-daddy of Load Balancing for Developers blogs, covering a useful bit of information about Application Delivery Controllers that you might want to take advantage of. For those who have joined us since my last installment, feel free to check out the entire list of blog entries (along with related blog entries) here, though I assure you that this installment, like most of the others, does not require you to have read those that went before. ZapNGo! Is still a growing enterprise, now with several dozen complex applications and a high availability architecture that spans datacenters and the cloud. While the organization relies upon its web properties to generate revenue, those properties have been going along fine with your Application Delivery Controller (ADC) architecture. Now though, you’re seeing a need to centralize administration of a whole lot of functions. What worked fine separately for one or two applications is no longer working so well now that you have several development teams and several dozen applications, and you need to find a way to bring the growing inter-relationships under control before maintenance and hidden dependencies swamp you in a cascading mess of disruption. With maintenance taking a growing portion of your application development manhours, and a reasonably well positioned test environment configured with a virtual ADC to mimic your production environment, all you need now is a way to cut those maintenance manhours and reduce the amount of repetitive work required to create or update an application. Particularly update an application, because that is a constant problem, where creating is less frequent. With many of the threats that your ZapNGo application will be known as ZapNGone eliminated, now it is efficiencies you are after. And believe it or not, these too are available in an ADC. Not all ADC’s are created equal, but this discussion will stay on topics that most ADCs can handle, and I’ll mention it when I stray from generic into specific – which I will do in one case because only one vendor supports one of the tools you can use, but all of the others should be supported by whatever ADC vendor you have, though as always, check with your vendor directly first, since I’m not an expert in the inner workings of every one. There is a lot that many organizations do for themselves, and the array of possibilities is long – from implementing load balancing in source code to security checks in the application, the boundaries of what is expected of developers are shaped by an organization, its history, and its chosen future direction. At ZapNGo, the team has implemented a virtual test environment that as close as possible mirrors production, so that code can be implemented and tested in the way it will be used. They use an ADC for load balancing, so that they don’t have to rewrite the same code over and over, and they have a policy of utilizing a familiar subset of ADC functionality on all applications that face the public. The company is successful and growing, but as always happens in companies in that situation, the pressures upon them are changing just by virtue of their growth. There are more new people who don’t yet have intimate knowledge of the code base, network topology, security policies, whatever their area of expertise is. There are more lines of code to maintain, while new projects are being brought up at a more rapid pace and with higher priorities (I’ve twice lived through the “Everything is high priority? Well this is highest priority!” syndrome while working in IT. Thankfully, most companies grow out of that fast when it’s pointed out that if everything is priority #1, nothing is). Timelines to complete projects – be they new development, bug fixes, or enhancements are stretching longer and longer as the percentage of gurus in the company is down and the complexity of the code and the architecture it runs on is up. So what is a development manager to do to increase productivity? Teaming newer developers with people who’ve been around since the beginning is helping, but those seasoned developers are a smaller and smaller percentage of the workforce, while the volume of work has slowly removed them from some of the many products now under management. Adopting coding standards and standardized libraries helps increase experience portability between projects, but doesn’t do enough. Enter offloading to the ADC. Some things just don’t have to be done in code, and if they don’t have to be, at this stage in the company’s growth, IT management at ZapNGo (that’s you!) decides they won’t be. There just isn’t time for non-essential development anymore. Utilizing a policy management tool and/or an Application Firewall on the ADC can improve security without increasing the code base, for example. And that shaves hours off of maintenance projects, while standardizing on one or a few implementations that are simply selected on the ADC. Implementing Web Application Acceleration protocols on the ADC means that less in-code optimization has to occur. Performance is no longer purely the role of developers (but of course it is still a concern. No Web Application Acceleration tool can make a loop that runs for five minutes run faster), they can allow the Web Application Acceleration tool to shrink the amount of data being sent to the users’ browser for you. Utilizing a WAN Optimization ADC tool to improve the performance of bulk copies or backups to a remote datacenter or cloud storage… The list goes on and on. The key is that the ADC enables a lot of opportunities for App Dev to be more responsive to the needs of the organization by moving repetitive tasks to the ADC and standardizing them. And a heaping bonus is that it also does that for operations with a different subset of functionality, meaning one toolset gives both App Dev and Operations a bit more time out of their day for servicing important organizational needs. Some would say this is all part of DevOps, some would say it is not. I leave those discussions to others, all I care is that it can make your apps more secure, fast, and available, while cutting down on workload. And if your ADC supports an SSL VPN, your developers can work from home when necessary. Or more likely, if your code is your IP, a subset of your developers can. Making ZapNGo more responsive, easier to maintain, and more adaptable to the changes coming next week/month/year. That’s what ADCs do. And they’re pretty darned good at it. That brings us to the one bit that I have to caveat with F5 only, and that is iApps. An iApp is a constructed configuration tool that asks a few questions and then deploys all the bits necessary to set up an ADC for a particular application. Why do I mention it here? Well if you have dozens of applications with similar characteristics, you can create an iApp Template and use it to rapidly bring new applications or new instances of applications online. And since it is abstracted, these iApp templates can be designed such that AppDev, or even the business owner, is able to operate them Meaning less time worrying about what network resources will be available, how they’re configured, and waiting for operations to have time to implement them (in an advanced ADC that is being utilized to its maximum in a complex application environment, this can be hundreds of networking objects to configure – all encapsulated into a form). Less time on the project timeline, more time for the next project. Or for the post deployment party. One of the two. That’s it for the F5 only bit. And knowing that all of these items are standardized means less things to get mis-configured, more surety that it will all work right the first time. As with all of these articles, that offers you the most important benefit… A good night’s sleep.234Views0likes0CommentsDevOps. It’s in the Culture, Not Tech.
#F5 DevOps – Managers need to make use of existing technology and adopt culture change. It is entertaining to read all that is currently being written about DevOps. Having been a developer, a development manager, an operations manager, and even a CTO, I can attest to the fact that the “throw it over the wall” syndrome is real, and causes real problems for everyone involved. That is about where my agreement with the current round of pundits ends. The thing is that they talk like there is some fundamental technological reason why DevOps isn’t happening. That’s just not true. For those a little behind in your jargon, DevOps is making operations prevalent in the decisions of your development organization. We’ll take the discussion a little bit at a time. We have virtualization. We have astoundingly good virtualization from VMWare, Microsoft, RedHat, et al. So many of the concerns about development and ops go away immediately. “Developers test in the environment their tools run in!'” is oft-heard in the DevOps conversations. But that’s just not an issue. They can run three different OS’s to test with, and as many browsers in each OS as you want to support – all with a single set of hardware. So make testing in your operational environment mandatory. In fact, for most tools, they’re developing on whatever OS is being targeted anyway, because there are subtle differences – or no support at all – in other OS’s. Virtualization taken hand-in-hand with the capabilities of an Application Delivery Controller (ADC) like F5’s BIG-IP can also remove some of the “throw it over the wall” symptoms easily – particularly in Agile or other high-rev environments. Take a copy of the VM running the app today, modify that copy, place it on the network, then, utilizing one of the many algorithms available for load-balancing, switch a select load to the new server. Make it 1/3rd as likely as any other server to receive a given connection, and utilize persistence so users aren’t bouncing back and forth between revisions of the app. See how it performs under real-life usage. Have the Devs there for the first half hour, then make a “deploy/don’t deploy” decision, and if not deploying, take down the copy with the new code on it so the devs can continue to work on it. If deploying, then bring up copies of the new server and bleed connections off of the old ones, then bring those virtuals down and archive or destroy them, as your organization sees fit. Most of the issues with DevOps are gone at that point. Testing is the only other issue that comes up a lot, and let’s talk frankly about testing. There just aren’t enough really good test tools out there. A test tool needs to automate as close to 100% of the testing process as possible in order for thorough testing to be feasible. The complexity of today’s applications leave most test tools in the dust. When Microsoft had MS-Test available, one of the shops I worked at used it, and one of my friends was an expert at generating test scripts, but we were a software shop. Our product was the software, making it much more critical that there be as close to zero defects as possible. That’s not the case if your product is not software. In those cases, software is a supporting tool to help sell product, and as such the occasional bug, while regrettable and to be avoided, doesn’t reflect upon your entire product line. So there are some tools out there to do testing. They’ll help determine things like buffer overflows and such, and Microsoft is selling Microsoft Test Center, which looks to be a more comprehensive solution, but no program knows your business – and thus the testing context – in the manner that your employees do, and most companies are not willing to shell out the money required to start a dedicated testing group. If you are one of the lucky ones, well you can replicate your entire production environment with VMs and an ADC… But you can’t get the volumes you would see in a live public-facing Internet application with actual personal interaction and all the dumb mistakes we users make. So you can test, after a fashion, and with real people involved even set up intelligent testing, but it will take some effort. The best bet for most shops is to make unit testing part of the developers’ jobs, and system testing part of the project managers’ job, with help from both dev and ops. See, Devops! It is a testament to the quality of tools and developers out there that we don’t see a ton of issues all the time. Think if the Web had the percentage of problems on a daily basis that Windows Apps did early on… We’d never get anything done. So don’t take DevOps so seriously from a technological standpoint, instead, let’s talk about culture. Developers need to feel like part of the larger team if you want them to worry about DevOps. Here’s the catch, most of them don’t want to feel like part of the larger team. Network issues and storage shortages annoy them as much as your business users. They want to write cool apps and shove them out the door (or the Internet connection, more precisely) without worrying about deployment issues too much. It is up to management to take concrete steps to move dev closer to ops. To do this, first mandate that all testing occur on the OS that deployment will occur on. This shouldn’t be a big deal for most shops, but in a few it will be. Second, mandate that the dev team put resources on the deployment, and utilize the Virtualization/ADC scenario discussed above. Third, remind developers that their app isn’t great unless users think it is, and it is deployable. They’ll come around, but it will take some work on your part. The days of developing in a vacuum are long gone (at least in the enterprise), and the disconnect between dev and ops is one of the few remaining artifacts of that time. You just have to remove the artifacts and hook up the strengths of your ops team with the leet app skills on your dev team, and the whole concept of DevOps is significantly reduced. One last bit, troubleshooting when things do go wrong. Developers can build a lot into their apps to give them hints about status, but ops can use tools like iApps (built into TMOS v 11 on F5 BIG-IP LTM) to show that it is indeed the application not responding in a timely manner – or indeed to show exactly what IS the bottleneck in a deployment. The reporting functionality on iApps makes them a worthwhile endeavor without the “ease of infrastructure management” they offer. iApp reporting can tell you exactly which piece of the application environment is slow, dragging the entire system response time down. I think that’s huge. If you have a BIG-IP, check it out, I’m pretty certain you will too. So call a meeting. Make it around lunch time and announce that pizza will be provided. Both teams will show up, and you can start changing culture. The benefits will be long-term, with applications better suiting users needs and requiring less operations man-hours. And devs will get a better feel for what works and doesn’t in your environment.214Views0likes0CommentsEven the best written code has a weakness.
Developers are a great lot of folks, people who spend their day trying to do the impossible with bits for a customer base that is, by and large, impossible to satisfy. When the bits all line up correctly, the last line of code has been checked in, and the nightly compile accepted for deployment, then they get to sit back, relax for five minutes, and start over again. If this makes you think it’s not a great life, then you should live it. Developing gives instant feedback. No matter how unhappy users can be, fixing that nagging bug you’ve been chasing for hours is a rush, and starting with a blank source code file is like looking across a wide-open plain. You can see what might be, and you get to go figure out how to do it. But yeah, it’s high-stress. Deadlines are constant, and it’s not like writing where you have to get your content finished, once the code is done, ten million people want to have input into what you should have done. Various techniques have been developed to mitigate the depressing fact that people tell you what they want after they see what you’ve built, but the fact is, for most ordinary users, be they business users or end users, they don’t know what they want until they see something working on their monitor and can play with it. Because they need a point of comparison. Some few can tell you sight-unseen, early in the process, what they’d like, but most will have increasing demands as the application’s capabilities increase. And these days, there’s one more major gotcha. You have to care about the network. I’ve been saying that for years, but we’ve passed the point where you could ignore me. Some will say “cloud changes all that!” but the truth is, cloud changes the problem domain, not the fact that you have to care. Let’s say you have a web application (as there are precious few other types being developed these days), and you have tweaked it to uber-performance so that it is scalable. You’ve put it behind a load balancer or application delivery controller so that even if your tweaks aren’t enough, you can share the load amongst several copies. You’ve done it all right. And your primary Internet connection goes down. So your network staff switches to the backup connection – which is invariably smaller than the primary. The problem in this scenario is that your application can be load balanced and highly optimized, but now it is fighting for bandwidth on a reduced connection. This is hardly the only scenario in which your application can suffer from outside interference. Ever been on the receiving end of a router configuration error? Your application appears down to everyone in the multi-verse, but in reality, it is responding just peachy but the network is routing your users to Timbuktu. I could tell you about all the great solutions that F5 offers for this problem or that problem (there are many of them, and they’re pretty darned good), but from your perspective, the issue is (or should be) much bigger than that. You need to be able to understand when the problem at hand is a network problem, and you need to be able to diagnose that fact quickly, so the right people are on the job. And that means you need to know networking. Just as importantly, you need to at least viscerally understand your specific network environment. They’re all a bit different, and the likely pain points are different, even though some problems are universal. A DDOS attack, for example, is aimed at clogging your Internet connection, no matter your architecture… But some networking gear reduces the ability of DDOS to actually take the site down, so your network might only see degraded performance. So ask the network team to teach you. Ask them what devices are between your applications and your customers. Ask them how these devices (or their malfunction) impact your applications. Know the environment you’re in, because for most applications today, a problem on the network makes for a poorly performing application. And that is indeed your responsibility. In the cloud you can’t know all of these things for real, but you can understand the concepts. Is there a virtual ADC? What is being used for firewall services? What perf tools are available to determine the bottlenecks of applications deployed in the cloud? All things you’ll want to know, so you can know how best to start troubleshooting when the inevitable problems occur. Learning things like this after your application is the source of user pain seems to still be the norm, but it’s certainly not the best solution. Either it increases the amount of time your application is getting bad PR, or you are fixing things hastily, and haste does indeed make waste in most critical application situations. This knowledge will also give you a new set of tools to solve problems with. If you know that a Web Application Acceleration tool like F5’s WebAccelerator is in place between your application and the user, then you might be able to say “rather than rewrite this chunk of code, let’s tweak the Web Application Acceleration engine to handle it” and save both time and potential coding defect issues. It’s still a great time to be a developer, the fun is still all there, it’s just a more complex world. Master your network architecture, and be a better developer for it. Why Flash can't win the Web application war Virtualization Changes Application Deployment But Not Development Amazon Makes the Cloud Sticky Return of the Web Application Platform Wars Wanted: Application Delivery Network Experts The Stealthy Ascendancy of JSON Now it's Time for Efficiency Gains in the Network. "Application Delivery" Role Missing "Delivery" Focus Finding Your Balance203Views0likes0CommentsApplications. Islands No More.
There was a time when application developers worried only about the hardware they were developing the application for. Those days passed a good long while ago, and then AppDev’s big concern was the OS the application was being developed for. But the burgeoning growth of the World Wide Web combined with the growth of Java and Linux to drive development out of the OS and into the JVM. Then, the developer focused on the JVM in question, or in many cases on the interpreted language interfaces – but not the OS or hardware. For our purposes I have lumped .NET in with JVMs as a simple head-not to where it came from. If you find that offensive, pretend I called them “AVMs” for “Application Virtual Machines”. In the end, interpreted bytecodes, no matter how you slice it. An interesting and increasing trend in this process was the growing dependence of both application and developer on the network. When I was a university lecturer, I made no bones about the fact that application developers would be required to know about the network going forward, and if the students didn’t like that news, they should find an alternative career. I stopped teaching eight years ago to dedicate more time to some of my other passions (like embedded dev, gaming, and my family), so that was a long time ago, and here we are. Developers know more about the network than they ever dreamed they would. Even DBAs have to know a decent amount about general networking to do their jobs. But they don’t know enough, because the network is one of the primary sources of application performance woes. And that makes the network a developers’ problem. Not maintaining it or managing it, but knowing how your applications are reliant upon it, both directly and indirectly, to deliver content to customers. And by knowing what your application needs, influencing changes in the architecture. In the end, your application has to have the appearance of performance to end users. In that sense it is much like the movies. It can look totally ridiculous while they are filming a moving – with people hanging in air and no backdrop or props around them – as long as the final product looks realistic, it impacts the viewers not one bit. But if computer animation or green screening causes the work of actors to look bad, people will be turned off. So it is with applications. If your network causes your application performance to suffer, the vast majority of the population is going to blame the application. The network is a nebulous thing, the application is the thing they were accessing. Technology is catching up to this concept – F5 has its new application based ADC functionality from iApps to application level failover – and that will be a huge boon to developers, but you are still going to have to care – about security, about network performance, about ADCs, about load balancing algorithms. The more you know about the environment your application runs in, the better a developer you will be. It is far better to be able to say “that change would increase traffic from the database to our application and from the application to the end user – meaning out our Internet connection, and unless I’m wrong, we don’t have the bandwidth for that…” than it is to make the change and then discover this simple truth. If you’re lucky enough to have an F5 product in your network, ask about iAPP, it’s well worth looking into, as it brings an application-oriented view to network infrastructure management. And when the decision is made to move to the cloud, you can take your iAPP templates with you. So they aren’t simply a single-architecture tool, running in BIG-IPLTM-VE, they will apply in the cloud also. While you’re learning, you’ll discover that some network appliances can help you do your job. Our Application Protocol Manager and Web Application Manager spring to mind, one helping with common security tasks, the other helping with delivery performance through caching, compression, and TCP optimization of application-specific data. You’ll also get ideas for optimizing the way your application communicates on the network, because you’ll understand the underlying technology better. It’s more to learn, but you have been freed from the nitty-gritty of the hardware and the OS, and in many cases from the difficult parts of algorithmic development. Which means adding high-level networking knowledge is a wash at worst. And it is one more way to separate yourself from the pack of developers, a way that will make you better at what you do.167Views0likes0CommentsLoad Balancing For Developers: Security and TCP Optimizations
It has been a while since I wrote a Load Balancing for Developers installment, and since they’re pretty popular and there’s still a lot about Application Delivery Controllers (ADCs) that are taken for granted in the Networking industry but relatively unknown in the development world, I thought I’d throw one out about making your security more resilient with ADCs. For those who are just joining this series, here’s the full list of posts I’ve tagged as Load Balancing for Developers, though only the ones whose title starts with “Load Balancing for Developers” or “Advance Load Balancing for Developers” were actually written from this perspective, utilizing our fictional web application Zap’N’Go! as an example. This post, like most of them, doesn’t require that you read the other entries in the “Load Balancers for Developers” series, but if you’re interested in the topic, they are all written from the developer’s perspective, and only bring in the networking/ops portions where it makes sense. So your organization has a truly successful web application called Zap’N’Go! That has taken the Internet by storm. Your hits are in the thousands an hour, and orders are rolling in. All was going well until your server couldn’t keep up and you went to a load balanced scenario so that multiple servers could share the load. The problem is that with the money you’ve generated off of Zap’N’Go, you’ve bought a competitor and started several new web applications, set up a forum or portal for your customers to communicate with you and each other directly, and are using the old datacenter from the company you purchased as a redundant datacenter in case the worst should happen. And all of that means that you are suffering server (and VM) sprawl. The CPU cycles being eaten up by your applications are truly astounding, and you’re looking into ways to drive them down. Virtualization helped you to be more agile in responding to the requests of the business, but also brings a lot of management overhead in making certain servers aren’t overloaded with too high a virtual density. One of the cool bits about an ADC is that they do a lot more than load balance, and much of that can be utilized to improve application performance without re-architecting the entire system. While there are a lot of ways that an ADC can improve application performance, we’ll look at a couple of easy ones here, and leave some of the more difficult or involved ones for another time. That keeps me in writing topics, and makes certain that I can give each one the attention it deserves in the space available. The biggest and most obvious improvement in an ADC is of course load balancing. This blog assumes you already have an ADC in place, and load balancing was your primary reason for purchasing it. While I don’t have market numbers in front of me, it is my experience that this is true of the vast majority of ADC customers. If you have overburdened web applications and have not looked into load balancing, before you go rewriting your entire system, take a look at the rest of this series. There really are options out there to help. After that win, I think the biggest place – in a virtualized environment – that developers can reap benefits from an ADC is one that developers wouldn’t normally think of. That’s the reason for this series, so I suppose that would be a good thing. Nearly every application out there hits a point where SSL is enabled. That point may be simply the act of accessing it, or it may be when they go to the “shopping cart” section of the web site, but they all use SSL to protect sensitive user data being passed over the Internet. As a developer, you don’t have to care too much about this fact. Pay attention to the protocol if you’re writing at that level and to the ports if you have reason to, but beyond that you don’t have to care. Networking takes care of all of that for you. But what if you could put a request in to your networking group that would greatly improve performance without changing a thing in your code and from a security perspective wouldn’t change much – most companies would see it as not changing anything, while a few will want to talk about it first? What if you could make this change over lunch and users wouldn’t know the difference? Here’s the background. SSL Encryption is expensive in terms of CPU cycles. No doubt you know that, most developers have to face this issue head-on at some point. It takes a lot of power to do encryption, and while commodity hardware is now fast enough that it isn’t a problem on a stand-alone server, in a VM environment, the number of applications requesting SSL encryption on the same physical hardware is many times what it once was. That creates a burden that, at this time at least, often drags on the hardware. It’s not the fault of any one application or a rogue programmer, it is the summation of the burdens placed by each application requiring SSL translation. One solution to this problem is to try and manage VM deployment such that encryption is only required on a couple of applications per physical server, but this is not a very appealing long-term solution as loads shift and priorities change. From a developers’ point of view, do you trust the systems/network teams to guarantee your application is not sharing hardware with a zillion applications that all require SSL encryption? Over time, this is not going to be their number one priority, and when performance troubles crop up, the first place that everyone looks in an in-house developed app is at the development team. We could argue whether that’s the right starting point or not, but it certainly is where we start. Another, more generic solution is to take advantage of a non-development feature of your ADC. This feature is SSL termination. Since the ADC sits between your application and the Internet, you can tell your ADC to handle encryption for your application, and then not worry about it again. If your network team sets this up for all of your applications, then you have no worries that SSL is burning up your CPU cycles behind your back. Is there a negative? A minor one that most organizations (as noted above) just won’t see as an issue. That is that from the ADC to your application, communications will happen in the clear. If your application is internal, this really isn’t a big deal at all. If you suspect a bad-guy on your internal network, you have much more to worry about than whether communications between two boxes are in the clear. If you application is in the cloud, this concern is more realistic, but in that case, SSL termination is limited in usefulness anyway because you can’t know if the other apps on the same hardware are utilizing it. So you simply flick a switch on your ADC to turn on SSL termination, and then turn it off on your applications, and you have what the ADC industry calls “SSL offload”. If your ADC is purpose-built hardware (like our BIG-IP), then there is encryption hardware in the box and you don’t have to worry about the impact to the ADC of overloading it with SSL requests, it’s built to handle the load. If your ADC is software or a VM (like our BIG-IP LTM VE), then you’ll have to do a bit of testing to see what the tolerance level for SSL load is on the hardware you deployed it on – but you can ask the network staff to worry about all of that, once you’ve started the conversation. Is this the only security-based performance boost you can get? No, but it is the easy one. Everything on the Internet remains encrypted, but your application is not burdening the server’s CPU with encryption requests each time communications in or out occur. The other easy one is TCP optimizations. This one requires less talk because it is completely out of the realm of the developer. Simply put, TCP is a well designed protocol that sometimes gets bogged down communicating and has a lot of overhead in those situations. Turning on TCP optimizations in your ADC can reduce the overhead – more or less, depending upon what is on the other end of the communications network – and improve perceived performance, which honestly is one of the most important measures of web application availability. By making it seem to load faster, you’ve improved your customer experience, and nothing about your development has to change. TCP optimizations are not new, and thus the ones that are turned on when you activate the option on most ADCs are stable and won’t disrupt most applications. Of course you should run a short test cycle with them enabled, just to be certain, but I would be surprised if you saw any issues. They’re not unheard of, but they are very rare. That’s enough for now, I think. I don’t want these to get so long that you wander off to develop some more. Keep doing what you do. And strive to keep your users from doing this. Slow apps anger users228Views0likes0CommentsForce Multipliers and Strategic Points of Control Revisited
On occasion I have talked about military force multipliers. These are things like terrain and minefields that can make your force able to do their job much more effectively if utilized correctly. In fact, a study of military history is every bit as much a study of battlefields as it is a study of armies. He who chooses the best terrain generally wins, and he who utilizes tools like minefields effectively often does too. Rommel in the desert often used Wadis to hide his dreaded 88mm guns – that at the time could rip through any tank the British fielded. For the last couple of years, we’ve all been inundated with the story of The 300 Spartans that held off an entire army. Of course it was more than just the 300 Spartans in that pass, but they were still massively outnumbered. Over and over again throughout history, it is the terrain and the technology that give a force the edge. Perhaps the first person to notice this trend and certainly the first to write a detailed work on the topic was von Clausewitz. His writing is some of the oldest military theory, and much of it is still relevant today, if you are interested in that type of writing. For those of us in IT, it is much the same. He who chooses the best architecture and makes the most of available technology wins. In this case, as in a war, winning is temporary and must constantly be revisited, but that is indeed what our job is – keeping the systems at their tip-top shape with the resources available. Do you put in the tool that is the absolute best at what it does but requires a zillion man-hours to maintain, or do you put in the tool that covers everything you need and takes almost no time to maintain? The answer to that question is not always as simple as it sounds like it should be. By way of example, which solution would you like your bank to put between your account and hackers? Probably a different one than the one you would you like your bank to put in for employee timekeeping. An 88 in the desert, compliments of WW2inColor Unlike warfare though, a lot of companies are in the business of making tools for our architecture needs, so we get plenty of options and most spaces have a happy medium. Instead of inserting all the bells and whistles they inserted the bells and made them relatively easy to configure, or they merged products to make your life easier. When the terrain suits a commanders’ needs in wartime, the need for such force multipliers as barbed wire and minefields are eliminated because an attacker can be channeled into the desired defenses by terrain features like cliffs and swamps. The same could be said of your network. There are a few places on the network that are Strategic Points of Control, where so much information (incidentally including attackers, though this is not, strictly speaking, a security blog) is funneled through that you can increase your visibility, level of control, and even implement new functionality. We here at F5 like to talk about three of them… Between your users and the apps they access, between your systems and the WAN, and between consumers of file services and the providers of those services. These are places where you can gather an enormous amount of information and act upon that information without a lot of staff effort – force multipliers, so to speak. When a user connects to your systems, the strategic point of control at the edge of your network can perform pre-application-access security checks, route them to a VPN, determine the best of a pool of servers to service their requests, encrypt the stream (on front, back, or both sides), redirect them to a completely different datacenter or an instance of the application they are requesting that actually resides in the cloud… The possibilities are endless. When a user accesses a file, the strategic point of control between them and the physical storage allows you to direct them to the file no matter where it might be stored, allows you to optimize the file for the pattern of access that is normally present, allows you to apply security checks before the physical file system is ever touched, again, the list goes on and on. When an application like replication or remote email is accessed over the WAN, the strategic point of control between the app and the actual Internet allows you to encrypt, compress, dedupe, and otherwise optimize the data before putting it out of your bandwidth-limited, publicly exposed WAN connection. The first strategic point of control listed above gives you control over incoming traffic and early detection of attack attempts. It also gives you force multiplication with load balancing, so your systems are unlikely to get overloaded unless something else is going on. Finally, you get the security of SSL termination or full-stream encryption. The second point of control gives you the ability to balance your storage needs by scripting movement of files between NAS devices or tiers without the user having to see a single change. This means you can do more with less storage, and support for cloud storage providers and cloud storage gateways extends your storage to nearly unlimited space – depending upon your appetite for monthly payments to cloud storage vendors. The third force-multiplies the dollars you are spending on your WAN connection by reducing the traffic going over it, while offloading a ton of work from your servers because encryption happens on the way out the door, not on each VM. Taking advantage of these strategic points of control, architectural force multipliers offers you the opportunity to do more with less daily maintenance. For instance, the point between users and applications can be hooked up to your ADS or LDAP server and be used to authenticate that a user attempting to access internal resources from… Say… and iPad… is indeed an employee before they ever get to the application in question. That limits the attack vectors on software that may be highly attractive to attackers. There are plenty more examples of multiplying your impact without increasing staff size or even growing your architectural footprint beyond the initial investment in tools at the strategic point of control. For F5, we have LTM at the Application Delivery Network Strategic Point of Control. Once that investment is made, a whole raft of options can be tacked on – APM, WOM, WAM, ASM, the list goes on again (tired of that phrase for this blog yet?). Since each resides on LTM, there is only one “bump in the wire”, but a ton of functionality that can be brought to bear, including integration with some of the biggest names in applications – Microsoft, Oracle, IBM, etc. Adding business value like remote access for devices, while multiplying your IT force. I recommend that you check it out if you haven’t, there is definitely a lot to be gained, and it costs you nothing but a little bit of your precious time to look into it. No matter what you do, looking closely at these strategic points of control and making certain you are using them effectively to meet the needs of your organization is easy and important. The network is not just a way to hook users to machines anymore, so make certain that’s not all you’re using it for. Make the most of the terrain. And yes, if you also read Lori’s blog, we were indeed watching the same shows, and talking about this concept, so no surprise our blogs are on similar wavelengths. Related Blogs: What is a Strategic Point of Control Anyway? Is Your Application Infrastructure Architecture Based on the ... F5 Tech Field Day – Intro To F5 As A Strategic Point Of Control What CIOs Can Learn from the Spartans What We Learned from Anonymous: DDoS is now 3DoS What is Network-based Application Virtualization and Why Do You ... They're Called Black Boxes Not Invisible Boxes Service Virtualization Helps Localize Impact of Elastic Scalability F5 Friday: It is now safe to enable File Upload262Views0likes0CommentsGotta Catch Em All. Multiple bottlenecks are a part of the IT lifestyle
My older children, like most kids in their age group, all played with or collected Pokemon cards. Just like I and all of my friends had GI Joes and discussed the strengths and weaknesses of Kung-fu grip versus hard hands, they and all of their friends sat around talking about how much cooler their current favorite Pokemon card was compared to all of the others. We let them play and kept an eye on how cards were being passed about the group (they’re small and tend to walk off, so we patrolled a bit, but otherwise stayed out of the way). And the interesting thing about Pokemon – or any other Collectible Card Game – is that as soon as you’ve settled your discussion about which card is “best”, someone picks a new favorite so you can rehash all the same issues with this new card in the mix. People – mostly but not exclusively children - honestly spend hours at this pass-time, and every time they resolve the differences, it starts all over again. The point of Pokemon is to catch and train little creatures (build a deck of cards) that will, on your command, battle other little creatures (the other players’ card deck) for supremacy. But that’s often lost in the discussions of which individual card or small combinations of cards is “best”. Everyone has their favorites and a focused direction, so these conversations can grow quite heated. It is no mistake that I’m discussing Pokemon in an IT blog. Our role is to support the business with applications that will allow them to do their job, or do their job better, or do things the competition can’t do. That’s why we’re here. But everyone in IT has a focus and direction – Developer, Architect, Network Admin, Systems Admin, Storage Admin, Business Analyst… The list goes on – and sometimes our conversations about how to best serve the business get quite heated. More importantly, sometimes the point of IT – to support the business – gets lost in examining the minutiae, just like comparing two Pokemon cards when there are hundreds of cards to build decks from. There are a few – like Charizard pictured here – that are special until they’re superseded by even cooler cards. But a lot of what we do is written in stone, and is easily lost in the shuffle. Just as no one champions the basic “energy” cards in Pokemon – because they don’t DO anything by themselves – we often don’t discuss some of the basic issues IT always has and always will struggle with, because they’re known, set in stone, and should be self-evident. Or at least we think they should. So I’ll remind you of one of the basics, and perhaps that will spur you to keep the simple stuff in mind whilst arguing over the coolest new toy in the datacenter. Image courtesy of Pokebeach.com The item I’ve chosen? There is never one bottleneck. It is a truth. If you find and eliminate the performance bottleneck of your application, you have not resolved all problems, you have simply removed a roadblock on the way to the next bottleneck. A system that ran fine last week may not be running fine this week because a new bottleneck threshold has been hit. And the bottlenecks are always – always inter-related. (Warning – of course I reference F5 products in this list, if you have other vendors, insert their names) Consider this, your web app is having performance problems, and you track it down to your network card utilization. So you upgrade the server or throw it behind your BIG-IP (or other ADC or a load balancer), and the problem is resolved. So now your CPU utilization is fine, but the application’s performance degrades again relatively quickly. You go researching and discover that your new bottleneck is storage. Too many high-access files on a single NAS device is slowing down simple file reads and writes. So you move your web servers to use a different NAS device (downright simple if you have ARX in-house, not too terribly difficult if you don’t), and a couple of weeks later users are complaining again. You dig and research, and all seems well to you, but there are enough complaints that you are pretty certain there’s a problem. So you call up a coworker in a remote office and have them check. They say performance stinks. So you go home that night and try it from home, and sure enough, outside the building performance stinks. Inside, it’s fine. Now your problem is your Internet connection. So you check the statistics, and back-end services like replication are burying your Internet connection. So you do some research and decide that your problems are best addressed by reducing the bandwidth required for those back-end processes and setting guaranteed bandwidth numbers for HTTP traffic. Enter WAN Optimization. If you’re an F5 customer, you just add WOM to your BIG-IP and configure it. Other vendors have a few more steps, but not terribly more than if you were not an F5 customer and bought BIG-IP with WOM to solve this problem. And once all of that clears up, guess what? We’re back to Pikachu. Your two servers, now completely cleared of other bottlenecks, are servicing so many requests that their CPU utilization is spiking. Time for a third server. Now this whole story sounds simple, but it isn’t. Network, Storage, Systems, all fall under the bailiwick of different groups within IT. It is never so easy as the above paragraph makes it sound… I’ve glossed over the long nights, the endless status meetings, the frustration of not finding the bottleneck right away – mine are obvious only because I list them, I skip the part where you check fifty other things first. And inevitable, there is the discussion of what’s the right solution to a given problem that starts to sound like people who discuss the “best” Pokemon card. Someone wants to cut back on the amount of bandwidth back-office applications use by turning off services, someone wants to buy a bigger pipe, someone suggests WAN optimization, and we go a few rounds until we settle on a plan that’s best for the organization in question. But in the end, keeping the business going and customers happy is the key to IT. Sure, clearing up one bottleneck will create another and spawn another round of “right solution” discussions, but that’s the point. It’s why you’re there. You have the skills and the expertise the company needs to keep moving forward, and this is how they’re applied. And along the way you’ll get to find the new hot toy in the datacenter and propose it as the right solution to everything, because it is your Charizard – until the next round of discussion anyway. And admit it, this stuff is fun, just like the game. Choosing the right solution, getting it implemented, that’s what drives all good IT people. Figuring out problems that are complex enough to be called rocket science under pressure that is sometimes oppressive. But the rush is there when the solution is in and is right. And it’s often a team effort by all of the different groups in IT. I personally think IT should throw itself more parties, but I guess we’ll just have to settle for more dinner-at-the-desk moments for the time being.222Views0likes0Comments