disaster recovery
33 TopicsMultiple Certs, One VIP: TLS Server Name Indication via iRules
An age old question that we’ve seen time and time again in the iRules forums here on DevCentral is “How can I use iRules to manage multiple SSL certs on one VIP"?”. The answer has always historically been “I’m sorry, you can’t.”. The reasoning is sound. One VIP, one cert, that’s how it’s always been. You can’t do anything with the connection until the handshake is established and decryption is done on the LTM. We’d like to help, but we just really can’t. That is…until now. The TLS protocol has somewhat recently provided the ability to pass a “desired servername” as a value in the originating SSL handshake. Finally we have what we’ve been looking for, a way to add contextual server info during the handshake, thereby allowing us to say “cert x is for domain x” and “cert y is for domain y”. Known to us mortals as "Server Name Indication" or SNI (hence the title), this functionality is paramount for a device like the LTM that can regularly benefit from hosting multiple certs on a single IP. We should be able to pull out this information and choose an appropriate SSL profile now, with a cert that corresponds to the servername value that was sent. Now all we need is some logic to make this happen. Lucky for us, one of the many bright minds in the DevCentral community has whipped up an iRule to show how you can finally tackle this challenge head on. Because Joel Moses, the shrewd mind and DevCentral MVP behind this example has already done a solid write up I’ll quote liberally from his fine work and add some additional context where fitting. Now on to the geekery: First things first, you’ll need to create a mapping of which servernames correlate to which certs (client SSL profiles in LTM’s case). This could be done in any manner, really, but the most efficient both from a resource and management perspective is to use a class. Classes, also known as DataGroups, are name->value pairs that will allow you to easily retrieve the data later in the iRule. Quoting Joel: Create a string-type datagroup to be called "tls_servername". Each hostname that needs to be supported on the VIP must be input along with its matching clientssl profile. For example, for the site "testsite.site.com" with a ClientSSL profile named "clientssl_testsite", you should add the following values to the datagroup. String: testsite.site.com Value: clientssl_testsite Once you’ve finished inputting the different server->profile pairs, you’re ready to move on to pools. It’s very likely that since you’re now managing multiple domains on this VIP you'll also want to be able to handle multiple pools to match those domains. To do that you'll need a second mapping that ties each servername to the desired pool. This could again be done in any format you like, but since it's the most efficient option and we're already using it, classes make the most sense here. Quoting from Joel: If you wish to switch pool context at the time the servername is detected in TLS, then you need to create a string-type datagroup called "tls_servername_pool". You will input each hostname to be supported by the VIP and the pool to direct the traffic towards. For the site "testsite.site.com" to be directed to the pool "testsite_pool_80", add the following to the datagroup: String: testsite.site.com Value: testsite_pool_80 If you don't, that's fine, but realize all traffic from each of these hosts will be routed to the default pool, which is very likely not what you want. Now then, we have two classes set up to manage the mappings of servername->SSLprofile and servername->pool, all we need is some app logic in line to do the management and provide each inbound request with the appropriate profile & cert. This is done, of course, via iRules. Joel has written up one heck of an iRule which is available in the codeshare (here) in it's entirety along with his solid write-up, but I'll also include it here in-line, as is my habit. Effectively what's happening is the iRule is parsing through the data sent throughout the SSL handshake process and searching for the specific TLS servername extension, which are the bits that will allow us to do the profile switching magic. He's written it up to fall back to the default client SSL profile and pool, so it's very important that both of these things exist on your VIP, or you may likely find yourself with unhappy users. One last caveat before the code: Not all browsers support Server Name Indication, so be careful not to implement this unless you are very confident that most, if not all, users connecting to this VIP will support SNI. For more info on testing for SNI compatibility and a list of browsers that do and don't support it, click through to Joel's awesome CodeShare entry, I've already plagiarized enough. So finally, the code. Again, my hat is off to Joel Moses for this outstanding example of the power of iRules. Keep at it Joel, and thanks for sharing! 1: when CLIENT_ACCEPTED { 2: if { [PROFILE::exists clientssl] } { 3: 4: # We have a clientssl profile attached to this VIP but we need 5: # to find an SNI record in the client handshake. To do so, we'll 6: # disable SSL processing and collect the initial TCP payload. 7: 8: set default_tls_pool [LB::server pool] 9: set detect_handshake 1 10: SSL::disable 11: TCP::collect 12: 13: } else { 14: 15: # No clientssl profile means we're not going to work. 16: 17: log local0. "This iRule is applied to a VS that has no clientssl profile." 18: set detect_handshake 0 19: 20: } 21: 22: } 23: 24: when CLIENT_DATA { 25: 26: if { ($detect_handshake) } { 27: 28: # If we're in a handshake detection, look for an SSL/TLS header. 29: 30: binary scan [TCP::payload] cSS tls_xacttype tls_version tls_recordlen 31: 32: # TLS is the only thing we want to process because it's the only 33: # version that allows the servername extension to be present. When we 34: # find a supported TLS version, we'll check to make sure we're getting 35: # only a Client Hello transaction -- those are the only ones we can pull 36: # the servername from prior to connection establishment. 37: 38: switch $tls_version { 39: "769" - 40: "770" - 41: "771" { 42: if { ($tls_xacttype == 22) } { 43: binary scan [TCP::payload] @5c tls_action 44: if { not (($tls_action == 1) && ([TCP::payload length] > $tls_recordlen)) } { 45: set detect_handshake 0 46: } 47: } 48: } 49: default { 50: set detect_handshake 0 51: } 52: } 53: 54: if { ($detect_handshake) } { 55: 56: # If we made it this far, we're still processing a TLS client hello. 57: # 58: # Skip the TLS header (43 bytes in) and process the record body. For TLS/1.0 we 59: # expect this to contain only the session ID, cipher list, and compression 60: # list. All but the cipher list will be null since we're handling a new transaction 61: # (client hello) here. We have to determine how far out to parse the initial record 62: # so we can find the TLS extensions if they exist. 63: 64: set record_offset 43 65: binary scan [TCP::payload] @${record_offset}c tls_sessidlen 66: set record_offset [expr {$record_offset + 1 + $tls_sessidlen}] 67: binary scan [TCP::payload] @${record_offset}S tls_ciphlen 68: set record_offset [expr {$record_offset + 2 + $tls_ciphlen}] 69: binary scan [TCP::payload] @${record_offset}c tls_complen 70: set record_offset [expr {$record_offset + 1 + $tls_complen}] 71: 72: # If we're in TLS and we've not parsed all the payload in the record 73: # at this point, then we have TLS extensions to process. We will detect 74: # the TLS extension package and parse each record individually. 75: 76: if { ([TCP::payload length] >= $record_offset) } { 77: binary scan [TCP::payload] @${record_offset}S tls_extenlen 78: set record_offset [expr {$record_offset + 2}] 79: binary scan [TCP::payload] @${record_offset}a* tls_extensions 80: 81: # Loop through the TLS extension data looking for a type 00 extension 82: # record. This is the IANA code for server_name in the TLS transaction. 83: 84: for { set x 0 } { $x < $tls_extenlen } { incr x 4 } { 85: set start [expr {$x}] 86: binary scan $tls_extensions @${start}SS etype elen 87: if { ($etype == "00") } { 88: 89: # A servername record is present. Pull this value out of the packet data 90: # and save it for later use. We start 9 bytes into the record to bypass 91: # type, length, and SNI encoding header (which is itself 5 bytes long), and 92: # capture the servername text (minus the header). 93: 94: set grabstart [expr {$start + 9}] 95: set grabend [expr {$elen - 5}] 96: binary scan $tls_extensions @${grabstart}A${grabend} tls_servername 97: set start [expr {$start + $elen}] 98: } else { 99: 100: # Bypass all other TLS extensions. 101: 102: set start [expr {$start + $elen}] 103: } 104: set x $start 105: } 106: 107: # Check to see whether we got a servername indication from TLS. If so, 108: # make the appropriate changes. 109: 110: if { ([info exists tls_servername] ) } { 111: 112: # Look for a matching servername in the Data Group and pool. 113: 114: set ssl_profile [class match -value [string tolower $tls_servername] equals tls_servername] 115: set tls_pool [class match -value [string tolower $tls_servername] equals tls_servername_pool] 116: 117: if { $ssl_profile == "" } { 118: 119: # No match, so we allow this to fall through to the "default" 120: # clientssl profile. 121: 122: SSL::enable 123: } else { 124: 125: # A match was found in the Data Group, so we will change the SSL 126: # profile to the one we found. Hide this activity from the iRules 127: # parser. 128: 129: set ssl_profile_enable "SSL::profile $ssl_profile" 130: catch { eval $ssl_profile_enable } 131: if { not ($tls_pool == "") } { 132: pool $tls_pool 133: } else { 134: pool $default_tls_pool 135: } 136: SSL::enable 137: } 138: } else { 139: 140: # No match because no SNI field was present. Fall through to the 141: # "default" SSL profile. 142: 143: SSL::enable 144: } 145: 146: } else { 147: 148: # We're not in a handshake. Keep on using the currently set SSL profile 149: # for this transaction. 150: 151: SSL::enable 152: } 153: 154: # Hold down any further processing and release the TCP session further 155: # down the event loop. 156: 157: set detect_handshake 0 158: TCP::release 159: } else { 160: 161: # We've not been able to match an SNI field to an SSL profile. We will 162: # fall back to the "default" SSL profile selected (this might lead to 163: # certificate validation errors on non SNI-capable browsers. 164: 165: set detect_handshake 0 166: SSL::enable 167: TCP::release 168: 169: } 170: } 171: }3.8KViews0likes18CommentsDeploying BIG-IP VE in VMware vCloud Director
Beginning with BIG-IP version 11.2, you may have noticed a new package in the Virtual Edition downloads folder for vCloud Director 1.5. VMware’s vCloud Director is a software solution enabling enterprises to build multi-tenant private clouds. Each virtual datacenter has its own resource set of cpu, memory, and disk that the vDC owner can allocate as necessary. F5 DevCentral is now running in these virtual datacenter configurations (as announced June 13th, 2012), with full BIG-IP VE infrastructure in place. This article will describe the deployment process to get BIG-IP VE installed and running in the vCloud Director environment. Uploading the vCloud Image The upload process is fairly simple, but it does take a while. First, after logging in to the vCloud interface, click catalogs, then select your private catalog. Once in the private catalog, click the upload button highlighted below. This will launch a pop up. Make sure the vCloud zip file has been extracted. When the .ovf is selected in this screen, it will grab that as well as the disk file after clicking upload. Now get a cup of coffee. Or a lot of them, this takes a while. Deploying the BIG-IP VE OVF Template Now that the image is in place, click on my cloud at the top navigation, select vApps, then select the plus sign, which will create a new vApp. (Or, the BIG-IP can be deployed into an existing vApp as well.) Select the BIG-IP VE template (bigip11_2 in the screenshot below) and click next. Give the vApp a name and click next. Accept the F5 EULA and click next. At this point, give the VM a full name and a computer name and click finish. I checked the network adapter box to show the network adapter type. It is not configurable at this point, and the flexible NIC is not the right one. After clicking finish, the system will create the vApp and build the VM, so maybe it’s time for another cup of coffee. Once the build is complete, click into the vapp_test vApp. Right-click on the testbigip-11-2 VM and select properties. Do NOT power on the VM yet! CPU and memory should not be altered. More CPU won’t help TMM, there is no CMP yet in the virtual edition and one extra CPU for system stuff is sufficient. TMM can’t schedule more than 4G of RAM either. Click the “Show network adapter type” and again you’ll notice the NICs are not correct. Delete all the network interfaces, then re-add one at a time as many (up to 10 in vCloud Director) NICs as is necessary for your infrastructure. To add a NIC, just click the add button and then select the network dropdown and select Add Network. At this point, you’ll need to already have a plan for your networking infrastructure. Organizational networks are usable in and between all vApps, whereas vApp networks are isolated to just that instance. I’ll show organizational network configuration in this article. Click Organization network and then click next. Select the appropriate network and click next. I’ve selected the Management network. For the management NIC I’ll leave the adapter type as E1000. The IP Mode is useful for systems where guest customization is enabled, but is still a required setting. I set it to Static-Manual and enter the self IP addresses assigned to those interfaces. This step is still required within the F5, it will not auto-configure the vlans and self IPs for you. For the remaining NICs that you add, make sure to set the adapter type to VMXNET 3. Then click OK to apply the new NIC configurations. *Note that adding more than 5 NICs in VE might cause the interfaces to re-order internally. If this happens, you’ll need to map the mac address in vCloud to the mac addresses reported in tmsh and adjust your vlans accordingly. Powering Up! After the configuration is updated, right-click on the testbigip-11-2 VM and select power on. After the VM powers on, BIG-IP VE will boot. Login with root/default credentials and type config at the prompt to set the management ip and netmask. Select No on auto-configuration Set the IP address. Then set the netmask. I selected no on the default route, but it might be necessary depending on the infrastructure you have in place. Finally, accept the settings. At this point, the system should be available on the management network. I have a linux box on that network as well so I can ssh into the BIG-IP VE to perform the licensing steps as the vCloud Director console does not support copy/paste.303Views0likes2CommentsThat Other Single Point of Failure
When you’re a kid at the beach, you spend a lot of time and effort building a sand castle. It’s cool, a lot of fun, and doomed to destruction. When high tide, or random kids, or hot sun come along, the castle is going to fall apart. It doesn’t matter, kids build them every year by the thousands, probably by the millions across the globe. Each is special and unique, each took time and effort, and each will fall apart. The thing is, they’re all over the globe, and seasons are different all over the globe, so it is conceivable that there is a sand castle built or being built every minute of every day. Not easily provable, but doesn’t need to be for this discussion. when it is night and middle of winter in the northern reaches of North America, it is summer and daytime in Australia. The opportunity for continuation of sand castles is amazing. Unless you’re in publishing or high-tech, it is likely that our entire organization is a single point of failure. Distributed applications make sense so that you can minimize risk and maximize uptime, right? The cloud is often billed as more resistant to downtime precisely because it is distributed. And your organization? Is it distributed? Really, spread out so that it can’t be impacted by something like Sandy? There are a good number of organizations that are nearly 100% off-line right now because there is no power in the Northeast. That was not a possibility, it was an inevitability. Power outages happen, and they sometimes happen on a grand scale (remember the cascading midwest/northeast/Canada outage a couple years back – that was not natural disaster, it was design and operator error). And yet, even companies with a presence in the cloud clustered their employees in one geographic area. There is a tendency amongst some to want face-to-face meetings, assuming those are more productive, which leads to desiring everyone be on-site. With increasing globalization, and meetings held around the world - long before I became a remote worker, I held meetings with staff in Africa, Russia, and California, all on the same (very long) day, and all from my home in Green Bay – one would think this tendency would be minimized, but it does not seem to be. The result is predictable. I once worked as a Strategic Architect for a life insurance company. They had a complete replica of the datacenter in a different geographic region, on the grounds that a disaster so horrible as to take out the datacenter would be exactly the scenario in which that backup would be needed. But guess where the staff was? Yeah, at the primary. The systems would have been running fine, but the IT knowledge, business knowledge, and claims adjustment would all have been in the middle of a disaster. Don’t make that mistake. Today, most organizations with multiple datacenters have DR plans that cover shifting all the load away from one of them should there be a problem, but those organizations with a single datacenter don’t have that leisure, and neither of them necessarily have a plan for continuation of actual work. Consider your options, consider how you will get actual business up to speed as quickly as possible. Losing their jobs because the business was not viable for weeks is not a great plan for helping people recover from disaster. Even with the cloud, there is critical corporate knowledge out there that makes your organization tick. It needs to be geographically distributed. It matter not what systems are in the cloud if all of the personnel to make them work are in the middle of a blackout zone. In short, think sand castles. If you have multiple datacenters, make certain your IT and business knowledge is split between them well enough to continue operations in a bad scenario. If you don’t have multiple locations, consider remote workers. Some people are just not cut out for telecommuting (I hate that phrase, since telecomm has little to do with the daily work, but it’s what we have), others do fine at it. Find some fine ones that have, or can be trained to have, the knowledge required to keep the organizations’ doors open. It could save the company a lot of money, and people a lot of angst. And your customers will be pleased too. The key is putting the right people and the right skills out there. Spread them across datacenters or geographies, so you’re distributed as well as your apps. And while you’re at it, broadening the pool of available talent means you can get some hires you might never have gotten if relocation was required. And all of that is a good thing. Like sand castles. Meanwhile, keep America’s northern east coast in your thoughts, that’s a lot of people in a little space without the amenities they’re accustomed to. Related Articles and Blogs: When Planning Disaster Recovery, Don't Forget the Small Stuff When The Walls Come Tumbling Down. First American Improves Performance and Mobility with F5 and ... Quick! The Data Center Just Burned Down, What Do You Do? First American: an F5 customer Fast DNS - DevCentral Wiki Let me tell you Where To Go. Migrating DevCentral to the Cloud Virtualize Absolutely Everything! Part II Deploying Viprion 2400 with ... DevCentral Top5 05/29/2012177Views0likes0CommentsIn the Cloud, It's the Little Things That Get You. Here are nine of them.
#F5 Eight things you need to consider very carefully when moving apps to the cloud. Moving to a model that utilizes the cloud is a huge proposition. You can throw some applications out there without looking back – if they have no ties to the corporate datacenter and light security requirements, for example – but most applications require quite a bit of work to make them both mobile and stable. Just connections to the database raise all sorts of questions, and most enterprise level applications require connections to DC databases. But these are all problems people are talking about. There are ways to resolve them, ugly though some may be. The problems that will get you are the ones no one is talking about. So of course, I’m happy to dive into the conversation with some things that would be keeping me awake were I still running a datacenter with a lot of interconnections and getting beat up with demands for cloudy applications. The last year has proven that cloud services WILL go down, you can’t plan like it won’t, regardless of the hype. When they do, your databases must be 100% in synch, or business will be lost. 100%. Your DNS infrastructure will need attention, possibly for the first time since you installed it. Serving up addresses from both local and cloud providers isn’t so simple. Particularly during downtimes. Security – both network and app - will have to be centralized. You can implement separate security procedures for each deployment environment, but you are only as strong as your weakest link, and your staff will have to remember which policies apply where if you go that route. Failure plans will have to be flexible. What if part of your app goes down? What if the database is down, but the web pages are fine – except for that “failed to connect to database” error? No matter what the hype says, the more places you deploy, the more likelihood that you’ll have an outage. The IT Managers’ role is to minimize that increase. After a failure, recovery plans will also need to be flexible. What if part of your app comes up before the rest? What if the database spins up, but is now out of synch with your backup or alternate database? When (not if) a security breech occurs on a cloud hosted server, how much responsibility does the cloud provider have to help you clean up? Sometimes it takes more than spinning down your server to clean up a mess, after all. If you move mission-critical data to the cloud, how are you protecting it? Contrary to the wild claims of the clouderati, your data is in a location you do not have 100% visibility into, you’re going to have to take extra steps to protect it. If you’re opening connections back to the datacenter from the cloud, how are you protecting those connections? They’re trusted server to trusted server, but “trusted” is now relative. Of course there are solutions brewing for most of these problems. Here are the ones I am aware of, I guarantee that, since I do not “read all of the Internets” each day (Lori does), I’m missing some, but it can get you started. Just include cloud in your DR plans, what will you do if service X disappears? Is the information on X available somewhere else? Can you move the app elsewhere and update DNS quickly enough? Global Server Load Balancing (GSLB) will help with this problem and others on the list – it will eliminate the DNS propagation lag at least. But beware, for many cloud vendors it is harder to do DR. Check what capabilities your provider supports. There are tools available that just don’t get their fair share of thunder, IMO – like Oracle GoldenGate – that replicate each SQL command to a remote database. These systems create a backup that exactly mirrors the original. As long as you don’t get a database modifying attack that looks valid to your security systems, these architectures and products are amazing. People generally don’t care where you host apps, as long as when they type in the URL or click on the URL, it takes them to the correct location. Global DNS and GSLB will take care of this problem for you. Get policy-based security that can be deployed anywhere, including the cloud, or less attractively (and sometimes impractically), code security into the app so the security moves with it. Application availability will have to go through another round like it did when we went distributed and then SOA. Apps will have to be developed with an eye to “is critical service X up?” where service X might well be in a completely different location from the app. If not, remedial steps will have to occur before the App can claim to be up. Or local Load Balancing can buffer you by making service X several different servers/virtuals. What goes down (hopefully) must come back up. But the same safety steps implemented in #5 will cover #6 nicely, for the most part. Database consistency checks are the big exception, do those on recovery. Negotiate this point if you can. Lots of cloud providers don’t feel the need to negotiate anything, but asking the questions will give you more information. Perhaps take your business to someone who will guarantee full cooperation in fixing your problems. If you actually move critical databases to the cloud, encrypt them. Yeah, I do know it’s expensive in processing power, but they’re outside the area you can 100% protect. So take the necessary step. Secure tunnels are your friend. Really. Don’t just open a hole in your firewall and let “trusted” servers in, because it is possible to masquerade as a trusted server. Create secure tunnels, and protect the keys. That’s it for now. The cloud has a lot of promise, but like everything else in mid hype cycle, you need to approach the soaring commentary with realistic expectations. Protect your data as if it is your personal charge, because it is. The cloud provider is not the one (or not the only one) who will be held accountable when things go awry. So use it to keep doing what you do – making your organization hum with daily business – and avoid the pitfalls where ever possible. In my next installment I’ll be trying out the new footer Lori is using, looking forward to your feedback. And yes, I did put nine in the title to test the “put an odd number list in, people love that” theory. I think y’all read my stuff because I’m hitting relatively close to the mark, but we’ll see now, won’t we?206Views0likes0CommentsLet me tell you Where To Go.
One thing in life, whether you are using a Garmin to go to a friend’s party or planning your career, you need to know where you’re going. Failure to have a destination in mind makes it very difficult to get directions. Even when you know where you’re going, you will have a terrible time getting there if your directions are bad. Take, for example, using a GPS to navigate between when they do major road construction and when you next update your GPS device’s maps. On a road by my house, I can actually drive down the road and be told that I’m on the highway 100 feet (30 meters) distant. Because I haven’t updated my device since they built this new road, it maps to the nearest one it can find going in the same direction. It is misinformed. And, much like the accuracy of a GPS, we take DNS for granted until it goes horribly wrong. Unfortunately, with both you can be completely lost in the wild before you figure out that something is wrong. The number of ways that DNS can go wrong is limited – it is a pretty simple system – but when it does, there is no way to get where you need to go. Just like when construction dead-ends a road. Like a road not too far from my house. Notice in the attached screenshot taken from Google Maps, how the satellite data doesn’t match the road data. The roads pictured by the satellite actually intersect. The ones pictured in roadway data do not. That is because they did intersect until about eight months ago. Now the roadway data is accurate, and one road has a roundabout, while the other passes over it. As you can plainly see, a GPS is going to tell you “go up here and turn right on road X”, when in reality it is not possible to do that any more. You don’t want your DNS doing the same thing. Really don’t. There are a couple of issues that could make your DNS either fail to respond or misdirect people. I’ll probably talk about them off-n-on over the next few months, because that’s where my head is at the moment, but we’ll discuss the two obvious ones today, just to keep this blog to blog length. First is failure to respond – either because it is overloaded, or down, or whatever. This one is easy to resolve with redundancy and load balancing. Add in Global Load Balancing, and you can distribute traffic between datacenters, internal clouds, external clouds, whatever, assuming you have the right gear for the job. But if you’re a single datacenter shop, simple redundancy is pretty straight-forward, and the only problem that might compel you to greater measures is a DDoS attack. While a risk, as a single datacenter shop, you’re not likely to attract the attention of crowds that want to participate in DDoS unless you’re in a very controversial market space. So make sure you have redundancy in DNS servers, and test them. Amazing the amount of backup/disaster recovery infrastructure that doesn’t have a regular, formalized testing plan. It does you no good to have it in place if it doesn’t work when you need it. The other is misdirection. The whole point of DNS cache poisoning is to allow someone to masquerade as you. wget can copy the look-n-feel of your website, cache poisoning (or some other as-yet-unutilized DNS vector) can redirect your users to the attacker. They typed in your name, they got a page that looks like your page, but any information they enter goes to someone else. Including passwords and credit card numbers. Scary stuff. So DNS SEC is pretty much required. It protects DNS against known attacks, and against a ton of unexplored vectors, by utilizing authorization and encryption. Yeah, that’s a horrible overstatement, but it works for a blog aimed at IT staff as opposed to DNS uber-specialists. So implement DNS SEC, but understand that it takes CPU cycles on DNS servers – security is never free – so if your DNS system is anywhere near capacity, it’s time to upgrade that 80286 to something with a little more zing. It is a tribute to DNS that many BIND servers are running on ancient hardware, because they can, but it doesn’t hurt any to refresh the hardware and get some more cycles out of DNS. In the real world, you would not use a GPS system that might send you to the wrong place (I shut mine down when in downtown Cincinnati because it is inaccurate, for example), and you wouldn’t use one that a crook could intercept the signal from and send you to a location of his choosing for a mugging rather than to your chosen destination… So don’t use a DNS that both of these things are possible for. Reports indicate that there are still many, many out of date DNS systems running out there. upgrade, implement DNS SEC, and implement redundancy (if you haven’t already, most DNS servers seem to be set up in pairs pretty well) or DNS load balancing. Let your customers know that you’re doing it more reliable and secure – for them. And worry about one less thing while you’re grilling out over the weekend. After all, while all of our systems rely on DNS, you have to admit it gets very little of our attention… Unless it breaks. Make yours more resilient, so you can continue to give it very little attention.214Views0likes0CommentsCloud vs Cloud
The Battle of the Clouds Aloha! Welcome ladies and gentleman to the face off of the decade. The Battle of the Clouds. In this corner, the up and comer, the phenom that has changed the way IT works, wearing the light shorts - The Cloud! And in this corner, your reigning champ, born and bred of Mother Nature with unstoppable power, wearing the dark trunks - Storm Clouds! You’ve either read about or lived through the massive storm that hit the Mid-Atlantic coast last week. And, by the way, if you are going through a loss, damage or worse, I do hope you can recover quickly and wish you the best. The weather took out power for millions including a Virginia ‘cloud’ datacenter which hosts a number of entertainment and social media sites. Many folks looking to get thru the candle-lit evenings were without their fix. While there has been confusion and growing pains over the years as to just what ‘cloud computing’ is, this instance highlights the fact that even The Cloud is still housed in a data center, with four walls, with power pulls, air conditioning, generators and many of the features we’ve become familiar with ever since the early days of the dot com boom (and bubble). They are physical structures, like our homes, that are susceptible to natural disasters among other things. Data centers have outages all the time but a single traditional data center outage might not get attention since it may only involve a couple companies – when a ‘cloud’ data center crashes, it could impact many companies and like last week, it grabbed headlines. Business continuity and disaster recovery are one of the main concerns for organizations since they rely on their system’s information to run their operations. Many companies use multiple data centers for DR and most cloud providers offer multiple cloud ‘locations’ as a service to protect against the occasional failure. But it is still a data center and most IT professionals have come to accept that a data center will have an outage – it’s just a question of how long and what impact or risk is introduced. In addition, you need the technology in place to be able to swing users to other resources when a outage occurs. A good number of companies don’t have a disaster recovery plan however, especially when backing up their virtual infrastructure in multiple locations. This can be understandable for a smaller start ups if backing up data means doubling their infrastructure (storage) costs but can be double disastrous for a large multi-national corporation. While most of the data center services have been restored and the various organizations are sifting through the ‘what went wrong’ documents, it is an important lesson in redundancy….or the risk of lack of. It might be an acceptable risk and a conscious decision since redundancy comes with a cost – dollars and complexity. A good read about this situation is Ben Coe’s My Friday Night With AWS. The Cloud has been promoting (and proven to some extent) it’s resilience, DR capabilities and it’s ability to technologically recover quickly yet Storm Clouds have proven time and again, that it’s power is unmatched…especially when you need power to turn on a data center. ps Resources Virginia Storm Knocks Out Popular Websites Millions without power as heat wave hammers eastern US Amazon Power Outage Exposes Risks Of Cloud Computing My Friday Night With AWS Modern life halted as Netflix, Pinterest, Instagram go down Storm Blamed for Instagram, Netflix, and Foursquare Outages (Real) Storm Crushes Amazon Cloud, Knocks out Netflix, Pinterest, Instagram237Views0likes0CommentsWILS: Virtualization, Clustering, and Disaster Recovery
#virtualization Clustering is local. Disaster recovery is global. There are two levels of reliability for an application. There’s local and there’s global. We might want to consider it more simply as “inside” and “outside” reliability. Virtualization enables local reliability – the inside kind of reliability. Whether you’re relying upon clustering or load balancing (each has advantages and disadvantages, but for purposes of reliability and this discussion we’ll assume equal capabilities) to provide the abstraction isn’t as important as recognizing that in terms of reliability you’re acting at the local, i.e. inside, level. A cluster or pool, in load balancing parlance, is able to maintain local reliability by distributing load across multiple instances of the application. We can transparently add or remove instances to achieve the elasticity necessary to meet demand, thus ensuring reliability. In the event of a local disaster, such as the failure of a virtual machine, we can take the failed instance out of the rotation and even provision another to replace it. What clustering (load balancing) can’t do is address global reliability, i.e. outside reliability. Global reliability must be addressed using a different technology, normally referred to as Global Server Load Balancing (GLSB). The terminology grew out of the days when global reliability was achieved by load balancing individual servers across the globe to ensure a failure in the network or at a specific location could not interrupt the service. As demand grew, GSLB performed the same functions, but did so at a site level, essentially load balancing sites instead of individual servers. The name remains, however confusing that may be to the uninitiated. To achieve global reliability you need GSLB. To avoid the detrimental effects of a disaster in the network or at the site level, you must be able to direct users to an active location. This is realized in most implementations through simple DNS load balancing techniques; i.e. when a user makes a request the GSLB service responds with the IP address of an appropriate, active site. GLSB is capable of much more complex decision making, however, and decisions can be based on a variety of business and operational parameters, at the discretion of the organization. The GSLB service monitors each of the local sites, and is able to detect an outage within seconds and begin directing users elsewhere. At the local level, clustering and load balancing also monitor the “health” of individual instances and can react similarly in the event of a failure, but do so only at the local level. If the site fails, as might be the case in the event of a disaster, the local service is unable to do anything about it. It can’t redirect globally, it can’t notify other components. It’s just gone. For disaster recovery purposes, this is important stuff. When cloud first drifted onto the scene is was postulated that the cheaper compute would make implementing secondary data centers specifically for disaster recovery purposes more financially feasible for a wider variety of organizations. While that’s true in the sense that it’s way cheaper than building a secondary data center, many of the technological foundations remain the same: GSLB and a replicated environment. Some folks balk at the replication and point to transparent migration as a solution. After all, why pay even pennies on the hour instances that may never be put into commission? The problem is that transparent migration of virtual machines is only useful while the VMs are live and running. If they aren’t, such as might be the case in the event of a disaster, the site can’t be replicated and global reliability fails. A cluster-to-cluster failover via a bridged network to the cloud might sound like a good idea, but it isn’t practical when applied to a disaster recovery scenario. Too much depends on the availability of the site, of the network, and of the clustering/load balancing mechanism itself. If any one of the components has failed, global reliability is unrealizable. To achieve true global reliability regardless of the involvement of cloud computing , you’re going to need to implement a good old-fashioned GSLB architecture, complete with the network components and replicated application infrastructure. Local reliability (inside) may be achievable with virtual clustering solutions, but global reliability requires a very different architecture and set of technologies. Disaster recovery strategies cannot rely on local reliability, they must be based on global reliability. WILS: Write It Like Seth. Seth Godin always gets his point across with brevity and wit. WILS is an ATTEMPT TO BE concise about application delivery TOPICS AND just get straight to the point. NO DILLY DALLYING AROUND. Back to Basics: Load balancing Virtualized Applications The Cost of Ignoring ‘Non-Human’ Visitors Cloud Bursting: Gateway Drug for Hybrid Cloud The HTTP 2.0 War has Just Begun Why Layer 7 Load Balancing Doesn’t Suck Network versus Application Layer Prioritization WILS: The Many Faces of TCP WILS: WPO versus FEO226Views0likes0CommentsF5 Friday: Cookie Cutter vApps Realized
An architectural solution to the challenge of IP-address dependency. A rarely mentioned obstacle when attempting to duplicate or migrate enterprise-class applications is IP-dependency. Not just topological dependencies that are easily addressed with dynamic routing and switching protocols in conjunction with a boot script, but internal dependencies – the ones so deeply embedded in the application’s “identity” that to change the IP address is to break the installation and render it useless. These are the applications that, upon asking for an exported image for testing purposes, virtualization experts will tell you is far more efficient to start from scratch, because the IP dependency issue will cause more trouble in the long term than simply starting over. Moving such an application to a public cloud is nearly impossible due to this restriction, and any bursting or data center extension model is out of the question. This is also a problem locally, when attempting to build out a private cloud and IT services. particularly in production environments in which a multi-tenant model is employed by launching multiple instances of the same application with each designated for use by a specific logical group, i.e. a department, project, or business unit. Ultimately what we want is the ability to create cookie cutter applications as a foundational element for IT as a Service. This requires network, security, and application policies – as well as the application – be encapsulated as templates, associated with the application, and applied on a per instance. This ultimately enables application instance sizing and chargeback per logical group, and lays the foundation for push-button IT services in which a department can be one click away from an automated deployment of an application. What’s standing in the way in many cases is the IP address dependency. Applications can’t be packaged up neatly into a holistic service along with its requisite network, security, and delivery policies because all of these services are tightly bound to the IP address of the application – and vice-versa. When an application is deployed if it is reassigned a new IP address, every policy will also need to be updated, making the process not only lengthy but fraught with potential for misconfiguration due to stalls or human error. The dependency on IP addresses within these applications is not going away. To achieve the goal of a more mobile and service-focused data center then, we need is a way to work around the problem. Many see VMware vApps as the solution. But while vApps were designed with mobility and portability in mind, it does not address the IP address dependency obstacle. A solution to this seemingly unsolvable problem can be found in a collaborative architecture incorporating both global and local application delivery services. A COLLABORATIVE F5–VMWARE ARCHITECTURAL SOLUTION To avoid complexity in multi-DC topologies (and ultimately inter-cloud deployments), it is necessary to reduce the need for coordination between different teams by abstracting network addressing, rules and service names. Bridging networks is not enough – an application and protocol specific approach is needed. VLAN stretching approaches do not differentiate traffic ingress and egress for each datacenter. This means that application traffic can enter one datacenter, traverse the bridged network to the application in the other datacenter, and then return following the same path. As the distance between data centers increases (as is desired for disaster recovery purposes), this “trombone routing” incurs heavy performance penalties due to latency. What we want is not single valid addresses for applications across datacenters, but rather, portable addresses which can then be selected by a global abstraction based on best-path and best-performance for a given client in the context of their locality and the available resources in each datacenter. Such an architecture is made possible by a rarely mentioned but very powerful feature of BIG-IP systems: route domains. Route domains give you the ability to segment (isolate) network traffic for different applications on the network. The BIG-IP system can process traffic for each application within its own route domain. Because route domains segment network traffic they can also be used to assign the same IP address or subnet to more than one node on a network. Two nodes on the network can have the same IP address as long as each instance of the IP address resides in a separate routing domain. The ability to essentially duplicate IP address space in the same environment opens up the ability to create cookie cutter vApps complete with the appropriate network, security, and delivery policies required – an isolated operationally consistent deployment. The problem then becomes ensuring that the right users are routed to the right application instance at the right time. Using a phased implementation, IT organizations can resolve the issues that prevent the repeatable deployment of enterprise applications locally and globally. PHASE 1 The focus of phase 1 is the elimination of re-addressing applications at the IP layer in multi-site deployments. This phase relies on BIG-IP Local Traffic Manager (LTM) and in particular route domains to allow the co-existence of architectures utilizing the same IP address space, and BIG-IP Global Traffic Manager (GTM) to determine which site is currently in use as the primary data center. In an active-standby deployment, this provides site-resilience by ensuring a secondary site is available to assume responsibility for delivering applications in the event of an outage at the primary site. In an active-active deployment, BIG-GTM leverages context shared by the local application delivery controller, BIG-IP LTM, to ensure better performance and availability without sacrificing fault tolerance. This deployment pattern is based on existing, proven global architectures providing site-resilience and location-based global load balancing. PHASE 2 This phase also relies on BIG-IP Local Traffic Manager (LTM) and route domains to allow the co-existence of architectures utilizing the same IP address space. Context-awareness is leveraged as a means to properly route users to their designated application deployment. The context can be extracted from the URI or from other variables associated with the user, such as credentials or cookies. Multiple instances of the application architecture can be launched and co-exist within the data center, each serving a particular logical group. Each group can size applications based on usage needs, and chargeback per department becomes a less complex accounting process as it is based on the instance and its supporting architectural components. Application architectures can be successfully repeated at the logical group level, enabling a smoother transition to IT as a Service and preserving the IP-address dependencies on which many applications rely. ARCHITECTURE is KEY As is increasingly the case, the solution to many of the challenges arising from multi-site, cloud computing , and highly virtualized data centers is architectural. Because the challenges often span data center domains – security, networking, storage, compute, and applications – the solution requires cross-domain collaboration, not just of teams but of infrastructure. Cloud computing really is an exercise in infrastructure integration. By leveraging the strengths and capabilities of various data center components across various domains, solutions can be architected to address even the seemingly unsolvable problems that will continue to frustrate IT as it moves toward a more distributed and highly dynamic data center.240Views0likes0CommentsWhen The Walls Come Tumbling Down.
When horrid disasters strike and both people and corporations are put on notice that they suddenly have a lot more important things to do, will you be ready? It is a testament to man’s optimism that with very few exceptions we really don’t, not at the personal level, not at the corporate level. I’ve worked a lot of places, and none of them had a complete, ready to rock DR plan. The insurance company I worked at was the closest – they had an entire duplicate datacenter sitting dark in a location very remote from HQ, awaiting need. Every few years they would refresh it to make certain that the standby DC had the correct equipment to take over, but they counted on relocating staff from what would be a ravaged area in the event of a catastrophe, and were going to restore thousands of systems from backups before the remote DC could start running. At the time it was a good plan. Today it sounds quaint. And it wasn’t that long ago. There are also a lot of you who have yet to launch a cloud initiative of any kind. This is not from lack of interest, but more because you have important things to do that are taking up your time. Most organizations are dragging their feet replacing people, and few – according to a recent survey, very few – are looking to add headcount (proud plug that F5 is – check out our careers page if you’re looking). It’s tough to run off and try new things when you can barely keep up with the day-to-day workloads. Some organizations are lucky enough to have R&D time set aside. I’ve worked at a couple of those too, and honestly, they’re better about making use of technology than those who do not have such policies. Though we could debate if they’re better because they take the time, or take the time because they’re better. And the combination of these two items brings us to a possible pilot project. You want to be able to keep your organization online or be able to bring it back online quickly in the event of an emergency. Technology is making it easier and easier to complete this arrangement without investing in an entire datacenter and constantly refreshing the hardware to have quick recovery times. Global DNS in various forms is available to redirect users from the disabled datacenter to a datacenter that is still capable of handling the load, if you don’t have multiple datacenters, then it can redirect elsewhere – like to virtual servers running in the cloud. ADCs are starting to be able to work similarly whether they are cloud deployed or DC deployed, that leaves keeping a copy of your necessary data and applications in the cloud, and cloud storage with a cloud storage gateway such as the Cloud Extender functionality in our ARX product allow for this to be done with a minimum of muss and fuss. These technologies, used together, yield a DR architecture that looks something like this: Notice that the cloud extender isn’t listed here, because it is useful for getting the data copied, but would most likely reside in your damaged datacenter. Assuming that the cloud provider was one like our partner Rackspace, who does both cloud VMs and cloud storage, this architecture is completely viable. You’ll still have to work some things out, like guaranteeing that security in the cloud is acceptable, but we’re talking about an emergency DR architecture here, not a long-running solution, so app-level security and functionality to block malicious attacks at the ADC layer will cover most of what you need. AND it’s a cloud project. The cost is far, far lower than a full blown DR project, and you’ll be prepared in case you need it. This buys you time to ingest the fact that your datacenter has been wiped out. I’ve lived through it, there is so much that must be done immediately – finding a new location, dealing with insurance, digging up purchase documentation, recovering what can be recovered… Having a plan like this one in place is worth your while. Seriously. It’s a strangely emotional time, and having a plan is a huge help in keeping people focused. Simply put, disasters come, often without warning – mine was a flood caused by a broken pipe. We found out when our monitoring equipment fried from being soaked and sent out a raft of bogus messages. The monitoring equipment was six feet above the floor at the time. You can’t plan for everything, but to steal and twist a famous phrase, “he who plans for nothing protects nothing.”199Views0likes0CommentsMission Impossible: Stateful Cloud Failover
The quest for truly stateful failover continues… Lightning was the latest cause of an outage at Amazon, this time in its European zones. Lightning, like tornadoes, volcanoes, and hurricanes are often categorized as “Acts of God” and therefore beyond the sphere of control of, well, anyone other than God. Outages or damages caused by such are rarely reimbursable and it’s very hard to blame an organization for not having a “plan” to react to the loss of both primary and secondary power supplies due to intense lightning strikes. The odds of a lightning strike are pretty high in the first place – 576000 to 1 – and though the results can be disastrous, such risk is often categorized as low enough to not warrant a specific “plan” to redress. What’s interesting about the analysis of the outage is the focus on what is, essentially, stateful failover capabilities. The Holy Grail of disaster recovery is to design a set of physically disparate systems in which the secondary system, in the event the primary fails, immediately takes over with no interruption in service or loss of data. Yes, you read that right: it’s a zero-tolerance policy with respect to service and data loss. And we’ve never, ever achieved it. Not that we haven’t tried, mind you, as Charles Babcock points out in his analysis of the outage: Some companies now employ a form of disaster recovery that stores a duplicate set of virtual machines at a separate site; they're started up in the event of failure at the primary site. But Kodukula said such a process takes several minutes to get systems started at an alternative site. It also results in loss of several minutes worth of data. Another alternative is to set up a data replication system to feed real-time data into the second site. If systems are kept running continuously, they can pick up the work of the failed systems with a minimum of data loss, he said. But companies need to employ their coordination expertise to make such a system work, and some data may still be lost. -- Amazon Cloud Outage: What Can Be Learned? (Charles Babcock, InformationWeek, August 2011) Disaster recovery plans are designed and implemented with the intention of minimizing loss. Practitioners are well aware that a zero-tolerance policy toward data loss for disaster recovery architectures is unrealistic. That is in part due to the “weakest link” theory that says a system’s is only as good as its weakest component’s . No application or network component can perform absolutely zero-tolerance failover, a.k.a. stateful failover, on its own. There is always the possibility that some few connections, transactions or sessions will be lost when a system fails over to a secondary system. Consider it an immutable axiom of computer science that distributed systems can never be data-level consistent. Period. If you think about it, you’ll see why we can deduce, then, that we’ll likely never see stateful failover of network devices. Hint: it’s because ultimately all state, even in network components, is stored in some form of “database” whether structured, unstructured, or table-based and distributed systems can never be data-level consistent. And if even a single connection|transaction|session is lost from a component’s table, it’s not stateful, because stateful implies zero-tolerance for loss. WHY ZERO-TOLERANCE LOSS is IMPOSSIBLE Now consider what we’re trying to do in a failover situation. Generally speaking we’re talking about component-level failure which, in theory and practice, is much easier than a full-scale architectural failover scenario. One device fails, the secondary takes over. As long as data has been synchronized between the two, we should theoretically be able to achieve stateful failover, right? Except we can’t. One of the realities with respect to high availability architectures is that synchronization is not continuous and neither is heartbeat monitoring (the mechanism by which redundant pairs periodically check to ensure the primary is still active). These processes occur on a period interval as defined by operational requirements, but are generally in the 3-5 second range. Assuming a connection from a client is made at point A, and the primary component fails at point A+1 second, it is unlikely that its session data will be replicated to the secondary before point A+3 seconds, at which time the secondary determines the primary has failed and takes over operation. This “miss” results in data loss. A minute amount, most likely, but it’s still data loss. Basically the axiom that zero-tolerance loss is impossible is a manifestation in the network infrastructure of Brewer’s CAP theorem at work which says you cannot simultaneously have Consistency, Availability and Partition Tolerance. This is evident before we even consider the slight delay that occurs on the network despite the use of gratuitous ARP to ensure a smooth(er) transition between units in the event of a failover, during which time the service may be (rightfully) perceived as unavailable. But we don’t need to complicate things any more than they already are, methinks. What is additionally frustrating, perhaps, is that the data loss could potentially be caused by a component other than the one that fails. That is, because of the interconnected and therefore interdependent nature of the network, a sort of cascading effect can occur in the event of a failover based on the topological design of the systems. It’s the architecture, silly, that determines whether data loss will be localized to the failing components or cascade throughout the entire architecture. High availability architectures based on a parallel data path design are subject to higher data loss throughout the network than are those based on cross-connected data path designs. Certainly the latter is more complicated and harder to manage, but it’s less prone to data loss cascade throughout the infrastructure. Now, add in the possibility that cloud-based disaster recovery systems necessarily leverage a network connection instead of a point-to-point serial connection. Network latency can lengthen the process and, if the failure is in the network connection itself, is obviously going to negatively impact the amount of data lost because synchronization cannot occur at all for that period of time when the failed primary is still “active” and the secondary realizes there’s a problem, Houston, and takes over responsibility. Now take these potential hiccups and multiply them by every redundant component in the system you are trying to assure availability for, and remember to take into consideration that to fail over to a second “site” requires not only data (as in database) replication but also state data replication across the entire infrastructure. Clearly, this is an unpossible task. A truly stateful cloud-based failover might be able to occur if the stars were aligned just right and the chickens sacrificed and the no-fail dance performed to do so. And even then, I’d bet against it happening. The replication of state residing in infrastructure components, regardless of how necessary they may be, is almost never, ever attempted. The reality is that we have to count on some data loss and the best strategy we can have is to minimize the loss in part by minimizing the data that must be replicated and components that must be failed over. Or is it? F5 Friday: Elastic Applications are Enabled by Dynamic Infrastructure The Database Tier is Not Elastic Greedy (IT) Algorithms The Impossibility of CAP and Cloud Brewer’s CAP Theorem Joe Weinman – Cloud Computing is NP-Complete Proof Cloud Computing Goes Back to College227Views0likes0Comments