disaster recovery
33 TopicsMultiple Certs, One VIP: TLS Server Name Indication via iRules
An age old question that we’ve seen time and time again in the iRules forums here on DevCentral is “How can I use iRules to manage multiple SSL certs on one VIP"?”. The answer has always historically been “I’m sorry, you can’t.”. The reasoning is sound. One VIP, one cert, that’s how it’s always been. You can’t do anything with the connection until the handshake is established and decryption is done on the LTM. We’d like to help, but we just really can’t. That is…until now. The TLS protocol has somewhat recently provided the ability to pass a “desired servername” as a value in the originating SSL handshake. Finally we have what we’ve been looking for, a way to add contextual server info during the handshake, thereby allowing us to say “cert x is for domain x” and “cert y is for domain y”. Known to us mortals as "Server Name Indication" or SNI (hence the title), this functionality is paramount for a device like the LTM that can regularly benefit from hosting multiple certs on a single IP. We should be able to pull out this information and choose an appropriate SSL profile now, with a cert that corresponds to the servername value that was sent. Now all we need is some logic to make this happen. Lucky for us, one of the many bright minds in the DevCentral community has whipped up an iRule to show how you can finally tackle this challenge head on. Because Joel Moses, the shrewd mind and DevCentral MVP behind this example has already done a solid write up I’ll quote liberally from his fine work and add some additional context where fitting. Now on to the geekery: First things first, you’ll need to create a mapping of which servernames correlate to which certs (client SSL profiles in LTM’s case). This could be done in any manner, really, but the most efficient both from a resource and management perspective is to use a class. Classes, also known as DataGroups, are name->value pairs that will allow you to easily retrieve the data later in the iRule. Quoting Joel: Create a string-type datagroup to be called "tls_servername". Each hostname that needs to be supported on the VIP must be input along with its matching clientssl profile. For example, for the site "testsite.site.com" with a ClientSSL profile named "clientssl_testsite", you should add the following values to the datagroup. String: testsite.site.com Value: clientssl_testsite Once you’ve finished inputting the different server->profile pairs, you’re ready to move on to pools. It’s very likely that since you’re now managing multiple domains on this VIP you'll also want to be able to handle multiple pools to match those domains. To do that you'll need a second mapping that ties each servername to the desired pool. This could again be done in any format you like, but since it's the most efficient option and we're already using it, classes make the most sense here. Quoting from Joel: If you wish to switch pool context at the time the servername is detected in TLS, then you need to create a string-type datagroup called "tls_servername_pool". You will input each hostname to be supported by the VIP and the pool to direct the traffic towards. For the site "testsite.site.com" to be directed to the pool "testsite_pool_80", add the following to the datagroup: String: testsite.site.com Value: testsite_pool_80 If you don't, that's fine, but realize all traffic from each of these hosts will be routed to the default pool, which is very likely not what you want. Now then, we have two classes set up to manage the mappings of servername->SSLprofile and servername->pool, all we need is some app logic in line to do the management and provide each inbound request with the appropriate profile & cert. This is done, of course, via iRules. Joel has written up one heck of an iRule which is available in the codeshare (here) in it's entirety along with his solid write-up, but I'll also include it here in-line, as is my habit. Effectively what's happening is the iRule is parsing through the data sent throughout the SSL handshake process and searching for the specific TLS servername extension, which are the bits that will allow us to do the profile switching magic. He's written it up to fall back to the default client SSL profile and pool, so it's very important that both of these things exist on your VIP, or you may likely find yourself with unhappy users. One last caveat before the code: Not all browsers support Server Name Indication, so be careful not to implement this unless you are very confident that most, if not all, users connecting to this VIP will support SNI. For more info on testing for SNI compatibility and a list of browsers that do and don't support it, click through to Joel's awesome CodeShare entry, I've already plagiarized enough. So finally, the code. Again, my hat is off to Joel Moses for this outstanding example of the power of iRules. Keep at it Joel, and thanks for sharing! 1: when CLIENT_ACCEPTED { 2: if { [PROFILE::exists clientssl] } { 3: 4: # We have a clientssl profile attached to this VIP but we need 5: # to find an SNI record in the client handshake. To do so, we'll 6: # disable SSL processing and collect the initial TCP payload. 7: 8: set default_tls_pool [LB::server pool] 9: set detect_handshake 1 10: SSL::disable 11: TCP::collect 12: 13: } else { 14: 15: # No clientssl profile means we're not going to work. 16: 17: log local0. "This iRule is applied to a VS that has no clientssl profile." 18: set detect_handshake 0 19: 20: } 21: 22: } 23: 24: when CLIENT_DATA { 25: 26: if { ($detect_handshake) } { 27: 28: # If we're in a handshake detection, look for an SSL/TLS header. 29: 30: binary scan [TCP::payload] cSS tls_xacttype tls_version tls_recordlen 31: 32: # TLS is the only thing we want to process because it's the only 33: # version that allows the servername extension to be present. When we 34: # find a supported TLS version, we'll check to make sure we're getting 35: # only a Client Hello transaction -- those are the only ones we can pull 36: # the servername from prior to connection establishment. 37: 38: switch $tls_version { 39: "769" - 40: "770" - 41: "771" { 42: if { ($tls_xacttype == 22) } { 43: binary scan [TCP::payload] @5c tls_action 44: if { not (($tls_action == 1) && ([TCP::payload length] > $tls_recordlen)) } { 45: set detect_handshake 0 46: } 47: } 48: } 49: default { 50: set detect_handshake 0 51: } 52: } 53: 54: if { ($detect_handshake) } { 55: 56: # If we made it this far, we're still processing a TLS client hello. 57: # 58: # Skip the TLS header (43 bytes in) and process the record body. For TLS/1.0 we 59: # expect this to contain only the session ID, cipher list, and compression 60: # list. All but the cipher list will be null since we're handling a new transaction 61: # (client hello) here. We have to determine how far out to parse the initial record 62: # so we can find the TLS extensions if they exist. 63: 64: set record_offset 43 65: binary scan [TCP::payload] @${record_offset}c tls_sessidlen 66: set record_offset [expr {$record_offset + 1 + $tls_sessidlen}] 67: binary scan [TCP::payload] @${record_offset}S tls_ciphlen 68: set record_offset [expr {$record_offset + 2 + $tls_ciphlen}] 69: binary scan [TCP::payload] @${record_offset}c tls_complen 70: set record_offset [expr {$record_offset + 1 + $tls_complen}] 71: 72: # If we're in TLS and we've not parsed all the payload in the record 73: # at this point, then we have TLS extensions to process. We will detect 74: # the TLS extension package and parse each record individually. 75: 76: if { ([TCP::payload length] >= $record_offset) } { 77: binary scan [TCP::payload] @${record_offset}S tls_extenlen 78: set record_offset [expr {$record_offset + 2}] 79: binary scan [TCP::payload] @${record_offset}a* tls_extensions 80: 81: # Loop through the TLS extension data looking for a type 00 extension 82: # record. This is the IANA code for server_name in the TLS transaction. 83: 84: for { set x 0 } { $x < $tls_extenlen } { incr x 4 } { 85: set start [expr {$x}] 86: binary scan $tls_extensions @${start}SS etype elen 87: if { ($etype == "00") } { 88: 89: # A servername record is present. Pull this value out of the packet data 90: # and save it for later use. We start 9 bytes into the record to bypass 91: # type, length, and SNI encoding header (which is itself 5 bytes long), and 92: # capture the servername text (minus the header). 93: 94: set grabstart [expr {$start + 9}] 95: set grabend [expr {$elen - 5}] 96: binary scan $tls_extensions @${grabstart}A${grabend} tls_servername 97: set start [expr {$start + $elen}] 98: } else { 99: 100: # Bypass all other TLS extensions. 101: 102: set start [expr {$start + $elen}] 103: } 104: set x $start 105: } 106: 107: # Check to see whether we got a servername indication from TLS. If so, 108: # make the appropriate changes. 109: 110: if { ([info exists tls_servername] ) } { 111: 112: # Look for a matching servername in the Data Group and pool. 113: 114: set ssl_profile [class match -value [string tolower $tls_servername] equals tls_servername] 115: set tls_pool [class match -value [string tolower $tls_servername] equals tls_servername_pool] 116: 117: if { $ssl_profile == "" } { 118: 119: # No match, so we allow this to fall through to the "default" 120: # clientssl profile. 121: 122: SSL::enable 123: } else { 124: 125: # A match was found in the Data Group, so we will change the SSL 126: # profile to the one we found. Hide this activity from the iRules 127: # parser. 128: 129: set ssl_profile_enable "SSL::profile $ssl_profile" 130: catch { eval $ssl_profile_enable } 131: if { not ($tls_pool == "") } { 132: pool $tls_pool 133: } else { 134: pool $default_tls_pool 135: } 136: SSL::enable 137: } 138: } else { 139: 140: # No match because no SNI field was present. Fall through to the 141: # "default" SSL profile. 142: 143: SSL::enable 144: } 145: 146: } else { 147: 148: # We're not in a handshake. Keep on using the currently set SSL profile 149: # for this transaction. 150: 151: SSL::enable 152: } 153: 154: # Hold down any further processing and release the TCP session further 155: # down the event loop. 156: 157: set detect_handshake 0 158: TCP::release 159: } else { 160: 161: # We've not been able to match an SNI field to an SSL profile. We will 162: # fall back to the "default" SSL profile selected (this might lead to 163: # certificate validation errors on non SNI-capable browsers. 164: 165: set detect_handshake 0 166: SSL::enable 167: TCP::release 168: 169: } 170: } 171: }3.8KViews0likes18CommentsQuick! The Data Center Just Burned Down, What Do You Do?
You get the call at 2am. The data center is on fire, and while the server room itself was protected with your high-tech fire-fighting gear, the rest of the building billowed out smoke and noxious gasses that have contaminated your servers. Unless you have a sealed server room, this is a very real possibility. Another possibility is that the fire department had to spew a ton of liquid on your building to keep the fire from spreading. No sealed room means your servers might have taken a bath. And sealed rooms are a real rarity in datacenter design for a whole host of reasons starting with cost. So you turn to your DR plan, and step one is to make certain the load was shifted to an alternate location. That will buy you time to assess the damage. Little do you know that while a good start, that’s probably not enough of a plan to get you back to normal quickly. It still makes me wonder when you talk to people about disaster recovery how different IT shops have different views of what’s necessary to recover from a disaster. The reason it makes me wonder is because few of them actually have a Disaster Recovery Plan. They have a “Pain Alleviation Plan”. This may be sufficient, depending upon the nature of your organization, but it may not be. You are going to need buildings, servers, infrastructure, and the knowledge to put everything back together – even that system that ran for ten years after the team that implemented it moved on to a new job. Because it wouldn’t still be running on Netware/Windows NT/OS2 if it wasn’t critical and expensive to replace. If you’re like most of us, you moved that system to a VM if at all possible years ago, but you’ll still have to get it plugged into a network it can work on, and your wires? They’re all suspect. The plan to restore your ADS can be painful in-and-of itself, let alone applying the different security settings to things like NAS and SAN devices, since they have different settings for different LUNS or even folders and files. The massive amount of planning required to truly restore normal function of your systems is daunting to most organizations, and there are some question marks that just can’t be answered today for a disaster that might happen in a year or even ten – hopefully never, but we do disaster planning so that we’re prepared if it does, so never isn’t a good outlook while planning for the worst. While still at Network Computing, I looked at some great DR plans ranging from “send us VMs and we’ll ship you servers ready to rock the same day your disaster happens” to “We’ll drive a truck full of servers to your location and you can load them up with whatever you need and use our satellite connection to connect to the world”. Problem is that both of these require money from you every month while providing benefit only if you actually have a disaster. Insurance is a good thing, but increasing IT overhead is risky business. When budget time comes, the temptation to stop paying each month for something not immediately forwarding business needs is palpable. And both of those solutions miss the ever-growing infrastructure part. Could you replace your BIG-IPs (or other ADC gear) tomorrow? You could get new ones from F5 pretty quickly, but do you have their configurations backed up so you can restore? How about the dozens of other network devices, NAS and SAN boxes, network architecture? Yeah, it’s going to be a lot of work. But it is manageable. There is going to be a huge time investment, but it’s disaster recovery, the time investment is in response to an emergency. Even so, adequate planning can cut down the time you have to invest to return to business-as-usual. Sometimes by huge amounts. Not having a plan is akin to setting the price for a product before you know what it costs to produce – you’ll regret it. What do you need? Well if you’re lucky, you have more than one datacenter, and all you need to do is slightly oversize them to make sure you can pick up the slack if one goes down. If you’re not one of the lucky organizations, you’ll need a plan for getting a building with sufficient power, internet capability, and space, replace everything from power connections to racks to SAN and NAS boxes, restorable backups (seriously, test your backups or replication targets. There are horror stories…), and time for your staff to turn all of these raw elements into a functional datacenter. It’s a tall order, you need backups of the configs of all appliances and information from all of your vendors about replacement timelines. But should you ever need this plan, it is far better to have done some research than to wake up in the middle of the night and then, while you are down, spend time figuring it all out. The toughest bit is keeping it up to date, because a project to implement a DR plan is a discrete project, but updating costs for space and lists of vendors and gear on a regular basis is more drudgery and outside of project timelines. But it’s worth the effort as insurance. And if your timeline is critical, look into one of those semi trailers – or the new thing (since 2005 or 2007 at least), containerized data centers - because when you need them, you need them. If you can’t afford to be down for more than a day or two, they’re a good stopgap while you rebuild. SecurityProcedure.com has an aggregated list of free DR plans online. I’ve looked at a couple of the plans they list, they’re not horrible, but make certain you customize them to your organization’s needs. No generic plan is complete for your needs, so make certain you cover all of your bases if you use one of these. The key is to have a plan that dissects all the needs post-disaster. I’ve been through a disaster (The Great NWC Lab Flood), and there are always surprises, but having a plan to minimize them is a first step to maintaining your sanity and restoring your datacenter to full function. In the future – the not-too-distant future – you will likely have the cloud as a backup, assuming that you have a product like our GTM to enable cloud-bursting, and that Global Load Balancer isn’t taken out by the fire. But even if it is, replacing one device to get your entire datacenter emulated in the cloud would not be anywhere near as painful as the rush to reassemble physical equipment. Marketing Image of an IBM/APC Container Lori and I? No, we have backups and insurance and that’s about it. But though our network is complex, we don’t have any businesses hosted on it, so this is perfectly acceptable for our needs. No containerized data centers for us. Let’s hope we, and you, never need any of this.614Views0likes0CommentsConfiguring a multi-server Testing Environment with VMWare Teams and BIG-IP LTM VE
Having LTM-VE is nice, but setting things up so that you can actually use it for something is a different matter. There is a lot of information floating around out there about configuration ranging from Joe's Tech Tip to Lori's Architecture Reference to deployment guides. This article focuses on setting up a VMWare team that includes your BIG-IP and some test servers. Since I wanted this to be something you can re-create, all of the servers utilized in this Tech Tip are clones of a straight-forward CENT-OS install. I toyed with a large number of OSS operating systems, and decided of all the OS's I installed, this was the one most suited to enterprise use, so cloned it a couple of times and changed the IP configuration. All of this will be documented below, the goal is to set you up so you can easily copy this environment. If for some reason you cannot, drop me an email and we can arrange for some way to get the files to you. One thing to remember throughout an install of LTM-VE is that in the end it is just a BIG-IP. Sure, it's in an odd environment that offers some challenges of its own, but from an administration perspective it is just another BIG-IP. In terms of throughput and configuration it is different because of the environment, but in terms of day-to-day administration, it has nodes, pools, virtuals, iRules, etc. that are all accessed and managed in the same manner as you would if this was a physical BIG-IP. I placed my BIG-IP and the three server instances into a single VMWare team, meaning I have single-button turn-on/turn-off for the entire test environment. For my purposes, my desktop (the VMWare host) is sufficient for representing clients, but I'll point out how and where you would add clients below. And last caveat, this entire Tech Tip is done in VMWare Workstation. If you are running one of the server versions, some dialogs/etc will be slightly different. First off, download and open the BIG-IP vmx file. A link to the vmx can be found on the DevCentral VE page. Once you have LTM-VE opened, you're ready to start your team-building exercise. Download CENT-OS from the link above, or some other OS that you are familiar with. Windows works just fine, I only excluded it from this solution because I can't be offering to give you copies of an OS that charges per-copy, and I did say if you had problems to drop me a line. You'll need to install CENT-OS into a new VM, but VMWare makes that so easy that even if you've never done it you should have no problems. Just choose New|VM and tell it where to find the ISO images you downloaded. Once the BIG-IP is loaded - and you can start it and play with it if you like, the default management address will be in the output of an ifconfig on the BIG-IP command line as eth0-mgmt. From your host OS's web browser you can navigate to that address and start licensing and configuration as explained in the release notes. Next you'll want to create the team though. From the VMWare menu, select File|New|Team like this... Give it a name and check the directory it will save to... Then add VMs to the team. Choosing "existing VM" will move a VM into the team, choosing "Clone of a VM" will copy a VM into the team. I basically configured one copy of CENT-OS and then added clones of that one to the VM. I used the same process with BIG-IP LTM VE. Finally, add network segments to the team if you are going to model an existing network with your team. In the case that we're exploring here, I did not add any network segments to the team, but rather relied upon VMWare's DHCP capability to assign me networks that I then utilized when configuring the servers to use static IPs (as they would in most environments). So the first thing you need to do after you’ve built your team is set your servers to use static IPs. They will default to DHCP, but that’s not the greatest solution for a load-balancing environment, since any change to the IP addresses of these servers will have an impact on the BIG-IP’s ability to load-balance. Note, if you are terribly lazy you probably don’t have to give them static IPs, because you control what goes in and out of the network, if you don’t add any servers to a segment, you should get the same IPs each time you boot. But boy, I just cannot recommend assuming a given server will always get the same address, that’s asking for trouble. One last thing to check, go into the Connections tab of the Team Settings dialog. We started with each of our three servers set to NAT, and eventually had to switch them to Host Only during config. If you can’t ping across the VLANs (there should be an implied route between them), this is something worth checking. And that covers creating the team. In short, the benefits of a team in our case are these: - Single button power on/off - Contained environment with everything you need to test in one VM - The ability to add network segments to emulate your production environment As I mentioned above, I put nothing but Virtuals directly on the external network because I planned on testing with just my host OS, but you may well want to add clients to the external interface. Next, you’ll have to configure the BIG-IP. You need to create an internal VLAN with the internal network (the one the servers are on) in it, and an external network that faces the outside world. The BIG-IP Management interface on first boot will be DHCP’d on your local network, you can change that setting through the browser on your client OS (we made it static on the local network). All of this is detailed in the release notes availabe on the VE page of DevCentral. After your VLANs are set, then you need to go into the client OS’s and set the route to point at the BIG-IP address on the internal VLAN. Bing! At that point you should be able to ping the world. If you can, then you can create the nodes, pools, virtuals in the BIG-IP and start routing traffic! I won’t go into a ton of detail here, since this article is running long already, but here are some screenshots of BIG-IP LTM VE with everything configured. Here’s the list of nodes as-we-added them… And the shots of the pool, showing it using these nodes… And the VS, showing it using the pool. All is set, the VS is listening on 192.168.42.110. There’s more I’d like to show you, but I’m not only out of room, this is longer than I’d wanted. So go forth and play, and I’ll circle back with some fun stuff like setting up iRules and such.311Views0likes0CommentsDeploying BIG-IP VE in VMware vCloud Director
Beginning with BIG-IP version 11.2, you may have noticed a new package in the Virtual Edition downloads folder for vCloud Director 1.5. VMware’s vCloud Director is a software solution enabling enterprises to build multi-tenant private clouds. Each virtual datacenter has its own resource set of cpu, memory, and disk that the vDC owner can allocate as necessary. F5 DevCentral is now running in these virtual datacenter configurations (as announced June 13th, 2012), with full BIG-IP VE infrastructure in place. This article will describe the deployment process to get BIG-IP VE installed and running in the vCloud Director environment. Uploading the vCloud Image The upload process is fairly simple, but it does take a while. First, after logging in to the vCloud interface, click catalogs, then select your private catalog. Once in the private catalog, click the upload button highlighted below. This will launch a pop up. Make sure the vCloud zip file has been extracted. When the .ovf is selected in this screen, it will grab that as well as the disk file after clicking upload. Now get a cup of coffee. Or a lot of them, this takes a while. Deploying the BIG-IP VE OVF Template Now that the image is in place, click on my cloud at the top navigation, select vApps, then select the plus sign, which will create a new vApp. (Or, the BIG-IP can be deployed into an existing vApp as well.) Select the BIG-IP VE template (bigip11_2 in the screenshot below) and click next. Give the vApp a name and click next. Accept the F5 EULA and click next. At this point, give the VM a full name and a computer name and click finish. I checked the network adapter box to show the network adapter type. It is not configurable at this point, and the flexible NIC is not the right one. After clicking finish, the system will create the vApp and build the VM, so maybe it’s time for another cup of coffee. Once the build is complete, click into the vapp_test vApp. Right-click on the testbigip-11-2 VM and select properties. Do NOT power on the VM yet! CPU and memory should not be altered. More CPU won’t help TMM, there is no CMP yet in the virtual edition and one extra CPU for system stuff is sufficient. TMM can’t schedule more than 4G of RAM either. Click the “Show network adapter type” and again you’ll notice the NICs are not correct. Delete all the network interfaces, then re-add one at a time as many (up to 10 in vCloud Director) NICs as is necessary for your infrastructure. To add a NIC, just click the add button and then select the network dropdown and select Add Network. At this point, you’ll need to already have a plan for your networking infrastructure. Organizational networks are usable in and between all vApps, whereas vApp networks are isolated to just that instance. I’ll show organizational network configuration in this article. Click Organization network and then click next. Select the appropriate network and click next. I’ve selected the Management network. For the management NIC I’ll leave the adapter type as E1000. The IP Mode is useful for systems where guest customization is enabled, but is still a required setting. I set it to Static-Manual and enter the self IP addresses assigned to those interfaces. This step is still required within the F5, it will not auto-configure the vlans and self IPs for you. For the remaining NICs that you add, make sure to set the adapter type to VMXNET 3. Then click OK to apply the new NIC configurations. *Note that adding more than 5 NICs in VE might cause the interfaces to re-order internally. If this happens, you’ll need to map the mac address in vCloud to the mac addresses reported in tmsh and adjust your vlans accordingly. Powering Up! After the configuration is updated, right-click on the testbigip-11-2 VM and select power on. After the VM powers on, BIG-IP VE will boot. Login with root/default credentials and type config at the prompt to set the management ip and netmask. Select No on auto-configuration Set the IP address. Then set the netmask. I selected no on the default route, but it might be necessary depending on the infrastructure you have in place. Finally, accept the settings. At this point, the system should be available on the management network. I have a linux box on that network as well so I can ssh into the BIG-IP VE to perform the licensing steps as the vCloud Director console does not support copy/paste.299Views0likes2CommentsMore Complexity, New Problems, Sounds Like IT!
It is a very cool world we live in, where technology is concerned. We’re looking at a near future where your excess workload, be it applications or storage, can be shunted off to a cloud. Your users have more power in their hands than ever before, and are chomping at the bit to use it on your corporate systems. IBM recently announced a memory/storage breakthrough that will make Flash disks look like 5.25 inch floppies. While we can’t know what tomorrow will bring, we can certainly know that the technology will enable us to be more adaptable, responsive, and (yes, I’ll say it) secure. Whether we actually are or not is up to us, but the tools will be available. Of course, as has been the case for the last thirty years, those changes will present new difficulties. Enabling technology creates issues… Which create opportunity for emerging technology. But we have to live through the change, and deal with making things sane. In the near future, you will be able to send backup and replication data to the cloud, reducing your on-site storage and storage administration needs by a huge volume. You can today, in fact, with products like F5’s ARX Cloud Extender. You will also be able to grant access to your applications from an increasing array of endpoint devices, again, you can do it today, with products like F5’s ASM for VPN access and APM for application security, but recent surveys and events in the security space should be spurring you to look more closely into these areas. SaaS is cool again in many areas that it had been ruled out – like email – to move the expense of relatively standardized high volume applications out of the datacenter and into the hands of trusted vendors. You can get email “in the cloud” or via traditional SaaS vendors. That’s just some of the changes coming along, and guess who is going to implement these important changes, be responsible for making them secure, fast, and available? That would be IT. To frame the conversation, I’m going to pillage some of Lori’s excellent graphics and we’ll talk about what you’ll need to cover as your environment changes. I won’t use the one showing little F5 balls on all of the strategic points of control, but if we have one. First, the points of business value and cost containment possible on the extended datacenter network. Notice that this slide is couched in terms of “how can you help the business”. Its genius is that Lori drew an architecture and then inserted business-relevant bits into it, so you can equate what you do every day to helping the business. Next up is the actual Strategic Points of Control slide, where we can see the technological equivalency of these points. So these few points are where you can hook in to the existing infrastructure offer you enhanced control of your network – storage, global, WAN, LAN, Internet clients – by putting tools into place that will act upon the data passing through them and contain policies and programmability that give you unprecedented automation. The idea here is that we are stepping beyond traditional deployments, to virtualization, remote datacenters, cloud, varied clients, ever-increasing storage (and cloud storage of course), while current service levels and security will be expected to be maintained. That’s a tall order, and stepping up the stack a bit to put strategic points of control into the network helps you manage the change without killing yourself or implementing a million specialized apps, policies, and procedures just to keep order and control costs. At the Global Strategic Point of Control, you can direct users to a working instance of your application, even if the primary application is unavailable and users must be routed to a remote instance. At this same place, you can control access to restricted applications, and send unauthorized individuals to a completely different server than the application they were trying to access. That’s the tip of the iceberg, with load balancing to local strategic points of control being one of the other uses that is beyond the scope of this blog. The Local Strategic Point of Control offers performance, reliability, and availability in the guise of load balancing, security in the form of content-based routing and application security – before the user has hit the application server – and encryption of sensitive data flowing internally and/or externally, without placing encryption burdens on your servers. The Storage Strategic Point of Control offers up tiering and storage consolidation through virtual directories, heterogeneous security administration, and abstraction of the NAS heads. By utilizing this point of control between the user and the file services, automation can act across vendors and systems to balance load and consolidate data access. It also reduces management time for endpoint staff, as the device behind a mount/map point can be changed without impacting users. Remote site VPN extension and DMZ rules consolidation can happen at the global strategic point of control at the remote site, offering a more hands-off approach to satellite offices. Note that WAN Optimization occurs across the WAN, over the Local and global strategic points of control. Web Application Optimization also happens at the global or local strategic point of control, on the way out to the end point device. What’s not shown is a large unknown in cloud usage – how to extend the control you have over the LAN out to the cloud via the WAN. Some things are easy enough to cover by sending users to a device in your datacenter and then redirecting to the cloud application, but this can be problematic if you’re not careful about redirection and bookmarks. Also, it has not been possible for symmetric tools like WAN Optimization to be utilized in this environment. Virtual appliances like BIG-IP LTM VE are resolving that particular issue, extending much of the control you have in the datacenter out to the cloud. I’ve said before, the times are still changing, you’ll have to stay on top of the new issues that confront you as IT transforms yet again, trying to stay ahead of the curve. Related Blogs: Like Load Balancing WAN Optimization is a Feature of Application ... Is it time for a new Enterprise Architect? Virtual Infrastructure in Cloud Computing Just Passes the Buck The Cloud Computing – Application Acceleration Connection F5 Friday: Secure, Scalable and Fast VMware View Deployment Smart Energy Cloud? Sounds like fun. WAN Optimization is not Application Acceleration The Three Reasons Hybrid Clouds Will Dominate F5 Friday: BIG-IP WOM With Oracle Products Oracle Fusion Middleware Deployment Guides Introducing: Long Distance VMotion with VMWare Load Balancers for Developers – ADCs Wan Optimization Functionality Cloud Control Does Not Always Mean 'Do it yourself' Best Practices Deploying IBM Web Sphere 7 Now Available275Views0likes0CommentsLoad Balancing For Developers: Improving Application Performance With ADCs
If you’ve never heard of my Load Balancing For Developers series, it’s a good idea to start here. There are quite a few installments behind us, and I’m not going to look back in this post any more than I must to make it readable without going back… Meaning there’s much more detail back there than I’ll relate here. Again after a lengthy sojourn covering other points of interest, I return to Load Balancing For Developers with a more holistic view – application performance. Lori has talked a bit about this topic, and I’ve talked about it in the form of Load Balancing benefits and algorithms, but I’d like to look more architecturally again, and talk about those difficult to uncover performance issues that web apps often face. You’re the IT manager for the company’s Zap-n-Go website, it has grown nearly exponentially since launch, and you’re the one responsible for keeping it alive. Lately it’s online, but your users are complaining of sluggishness. Following the advice of some guy on the Internet, you put a load balancer in about a year ago, and things were better, but after you put in a redundant data center and Global Load Balancing services, things started to degrade again. Time to rethink your architecture before your product gets known as Zap-N-Gone… Again. Thus far you have a complete system with multiple servers behind an ADC in your primary data center, and a complete system with multiple servers behind an ADC in your secondary data center. Failover tests work correctly when you shut down the primary web servers, and the database at the remote location is kept up to date with something like Data Guard for Oracle or Merge Replication Services for SQL Server. This meets the business requirement that the remote database is up-to-date except for those transactions in-progress at the moment of loss. This makes you highly HA, and if your ADCs are running as an HA pair and your Global DNS – Like our GTM product - is smart enough to switch when it notices your primary site is down, most users won’t even know they’ve been shoved off to the backup datacenter. The business is happy, you’re sleeping at night, all is well. Except that slowly, as usage for the site has grown, performance has suffered. What started as a slight lag has turned into a dragging sensation. You’ve put more web servers into the pool of available resources – or better yet, used your management tools (in the ADC and on your servers) to monitor all facets of web server performance – disk and network I/O, CPU and memory utilization. And still, performance lags. Then you check on your WAN connection and database, and find the problem. Either the WAN connection is overloaded, or the database is waiting long periods of time for responses from the secondary datacenter. If you have things configured so that the primary doesn’t wait for acknowledgment from the secondary database, then your problem might be even more sinister – some transactions may never get deposited in the secondary datacenter, causing your databases to be out of synch. And that’s a problem because you need the secondary database to be as up to date as possible, but buying more bandwidth is a monthly overhead expense, and sometimes it doesn’t help – because the problem isn’t always about bandwidth, sometimes it is about latency. In fact, with synchronous real-time replication, it is almost always about latency. Latency, for those who don’t know, is a combination of how far your connection must travel over the wire and the number of “bumps in the wire” that have been inserted. Not actually the number of devices, but the number and their performance. Each device that touches your data – packet inspection, load balancing, security, whatever the reason – adds time to the delivery window. So does traveling over the wires/fiber. Synchronous replication is very time sensitive. If it doesn’t hear back in time, it doesn’t commit the changes, and then the primary and secondary databases don’t match up. So you need to cut down the latency and improve the performance of your WAN link. Conveniently, your ADC can help. Out-of-the-box it should have TCP optimizations that cut down the impact of latency by reducing the number of packets going back and forth over the wire. It may have compression too – which cuts down the amount of data going over the wire, reducing the number of packets required, which improves the “apparent” performance and the amount of data on your WAN connection. They might offer more functionality than that too. And you’ve already paid for an HA pair – putting one in each datacenter – so all you have to do is check what they do “out of the box” for WAN connections, and then call your sales representative to find out what other functionality is available. F5 includes some functionality in our LTM product, and has more in our add-on WAN Optimization Module (WOM) that can be bought and activated on your BIG-IP. Other vendors have a variety of architectures to offer you similar functionality, but of course I work for and write for F5, so my view is that they aren’t as good as our products… Certainly check with your incumbent vendor before looking for other solutions to this problem. We have seen cases where replication was massively improved with WAN Optimization. More on that in the coming days under a different topic, but just the thought that you can increase the speed and reliability of transaction-based replication (and indeed, file/storage replication, but again, that’s another blog), and you as a manager or a developer do not have to do a thing to your code. That implies the other piece – that this method of improvement is applicable to applications that you have purchased and do not own the source code for. So check it out… At worst you will lose a few hours tracking down your vendor’s options, at best you will be able to go back to sleep at night. And if you’re shifting load between datacenters, as I’ve mentioned before, Long Distance vMotion is improved by these devices too. F5’s architecture for this solution is here – PDF deployment guide. This guide relies upon the WOM functionality mentioned above. And encryption is supported between devices. That means if you are not encrypting your replication, that you can start without impacting performance, and if you are encrypting, you can offload the work of encryption to a device designed to handle it. And bandwidth allocation means you can guarantee your replication has enough bandwidth to stay up to date by giving it priority. But you won’t care too much about that, you’ll be relaxing and dreaming of beaches and stock options… Until the next emergency crops up anyway.255Views0likes0CommentsCloud Computing: Location is important, but not the way you think
The debate this week is on location, specifically we're back arguing over whether there exist such things as "private" clouds. Data Center Knowledge has a good recap of some of the opinions out there on the subject, and of course I have my own opinion. Location is, in fact, important to cloud computing, but probably not in the way most people are thinking right now. While everyone is concentrating on defining cloud computing based on whether it's local or remote, folks have lost sight that location is important for other reasons. It is the location of data centers that is important to cloud computing. After all, a poor choice in physical location can incur additional risk for enterprises trusting their applications to a cloud computing provider. Enterprises residing physically in high risk areas - those prone to natural disasters, primarily - understand this and often try to mitigate that risk by building out a secondary data center in a less risky location, just in case. But it's not only the physical and natural risk factors that need to be considered. The location of a data center can have a significant impact on the performance of applications delivered out of a cloud computing environment. If a cloud computing provider's primary data center is in India, or Russia, for example, and most of your users are in the U.S., the performance of that application will be adversely affected by the speed of light problem - the one that says packets can only travel so fast, and no faster, due to the laws of physics. While there are certainly ways to ameliorate the affects of the speed-of-light problem - acceleration and optimization techniques, for example - they are not a cure all. The recent loss of 3 of 4 undersea cables that transport most of the Internet data between continents proves that accidents are not only naturally occurring, but man-made as well, and the effects can be devastating on applications and their users. If you're using a cloud computing provider such as Blue Lock as a secondary or tertiary data center for disaster recovery, but their primary data center is merely a few miles from your primary data center, you aren't gaining much protection against a natural disaster, are you? Location is, in fact, important in the choice of a cloud computing provider. You need to understand where their primary and secondary data centers are located in order to ensure that the business justification for using a cloud computing provider is actually valid. If your business case is built on the reduction of CapEx and OpEx maintaining a disaster recovery site, you should make certain that in the event of a local disaster that the cloud computing provider's data center is unlikely to be affected as well, or you risk wasting your investment in that disaster recovery plan. Waste, whether large or small, of budgets today is not looked upon favorably by those running your business. Given that portability across cloud computing providers today is limited, despite the claims of providers, it is difficult to simply move your applications from one cloud to another quickly. So choose your provider carefully, based not only on matching your business and technological needs to the model they support but on the physical location and distribution of their data centers. Location is important; not to the definition of cloud computing but in its usage. Related articles by Zemanta The Case for "Private Clouds" Economic and Environmental advantages of Cloud Computing Cloud Costing: Fixed Costs vs Variable Costs & CAPEX Vs OPEX Sun feeds data center pods to credit crunched Microsoft 2.0 feels data center pinch 3 steps to a fast, secure, and reliable application infrastructure Cloud API Propagation and the Race to Zero (Cloud Interoperability) Data Center Knowledge: Amazon's Cloud Computing Data Center Locations244Views0likes0CommentsF5 Friday: Cookie Cutter vApps Realized
An architectural solution to the challenge of IP-address dependency. A rarely mentioned obstacle when attempting to duplicate or migrate enterprise-class applications is IP-dependency. Not just topological dependencies that are easily addressed with dynamic routing and switching protocols in conjunction with a boot script, but internal dependencies – the ones so deeply embedded in the application’s “identity” that to change the IP address is to break the installation and render it useless. These are the applications that, upon asking for an exported image for testing purposes, virtualization experts will tell you is far more efficient to start from scratch, because the IP dependency issue will cause more trouble in the long term than simply starting over. Moving such an application to a public cloud is nearly impossible due to this restriction, and any bursting or data center extension model is out of the question. This is also a problem locally, when attempting to build out a private cloud and IT services. particularly in production environments in which a multi-tenant model is employed by launching multiple instances of the same application with each designated for use by a specific logical group, i.e. a department, project, or business unit. Ultimately what we want is the ability to create cookie cutter applications as a foundational element for IT as a Service. This requires network, security, and application policies – as well as the application – be encapsulated as templates, associated with the application, and applied on a per instance. This ultimately enables application instance sizing and chargeback per logical group, and lays the foundation for push-button IT services in which a department can be one click away from an automated deployment of an application. What’s standing in the way in many cases is the IP address dependency. Applications can’t be packaged up neatly into a holistic service along with its requisite network, security, and delivery policies because all of these services are tightly bound to the IP address of the application – and vice-versa. When an application is deployed if it is reassigned a new IP address, every policy will also need to be updated, making the process not only lengthy but fraught with potential for misconfiguration due to stalls or human error. The dependency on IP addresses within these applications is not going away. To achieve the goal of a more mobile and service-focused data center then, we need is a way to work around the problem. Many see VMware vApps as the solution. But while vApps were designed with mobility and portability in mind, it does not address the IP address dependency obstacle. A solution to this seemingly unsolvable problem can be found in a collaborative architecture incorporating both global and local application delivery services. A COLLABORATIVE F5–VMWARE ARCHITECTURAL SOLUTION To avoid complexity in multi-DC topologies (and ultimately inter-cloud deployments), it is necessary to reduce the need for coordination between different teams by abstracting network addressing, rules and service names. Bridging networks is not enough – an application and protocol specific approach is needed. VLAN stretching approaches do not differentiate traffic ingress and egress for each datacenter. This means that application traffic can enter one datacenter, traverse the bridged network to the application in the other datacenter, and then return following the same path. As the distance between data centers increases (as is desired for disaster recovery purposes), this “trombone routing” incurs heavy performance penalties due to latency. What we want is not single valid addresses for applications across datacenters, but rather, portable addresses which can then be selected by a global abstraction based on best-path and best-performance for a given client in the context of their locality and the available resources in each datacenter. Such an architecture is made possible by a rarely mentioned but very powerful feature of BIG-IP systems: route domains. Route domains give you the ability to segment (isolate) network traffic for different applications on the network. The BIG-IP system can process traffic for each application within its own route domain. Because route domains segment network traffic they can also be used to assign the same IP address or subnet to more than one node on a network. Two nodes on the network can have the same IP address as long as each instance of the IP address resides in a separate routing domain. The ability to essentially duplicate IP address space in the same environment opens up the ability to create cookie cutter vApps complete with the appropriate network, security, and delivery policies required – an isolated operationally consistent deployment. The problem then becomes ensuring that the right users are routed to the right application instance at the right time. Using a phased implementation, IT organizations can resolve the issues that prevent the repeatable deployment of enterprise applications locally and globally. PHASE 1 The focus of phase 1 is the elimination of re-addressing applications at the IP layer in multi-site deployments. This phase relies on BIG-IP Local Traffic Manager (LTM) and in particular route domains to allow the co-existence of architectures utilizing the same IP address space, and BIG-IP Global Traffic Manager (GTM) to determine which site is currently in use as the primary data center. In an active-standby deployment, this provides site-resilience by ensuring a secondary site is available to assume responsibility for delivering applications in the event of an outage at the primary site. In an active-active deployment, BIG-GTM leverages context shared by the local application delivery controller, BIG-IP LTM, to ensure better performance and availability without sacrificing fault tolerance. This deployment pattern is based on existing, proven global architectures providing site-resilience and location-based global load balancing. PHASE 2 This phase also relies on BIG-IP Local Traffic Manager (LTM) and route domains to allow the co-existence of architectures utilizing the same IP address space. Context-awareness is leveraged as a means to properly route users to their designated application deployment. The context can be extracted from the URI or from other variables associated with the user, such as credentials or cookies. Multiple instances of the application architecture can be launched and co-exist within the data center, each serving a particular logical group. Each group can size applications based on usage needs, and chargeback per department becomes a less complex accounting process as it is based on the instance and its supporting architectural components. Application architectures can be successfully repeated at the logical group level, enabling a smoother transition to IT as a Service and preserving the IP-address dependencies on which many applications rely. ARCHITECTURE is KEY As is increasingly the case, the solution to many of the challenges arising from multi-site, cloud computing , and highly virtualized data centers is architectural. Because the challenges often span data center domains – security, networking, storage, compute, and applications – the solution requires cross-domain collaboration, not just of teams but of infrastructure. Cloud computing really is an exercise in infrastructure integration. By leveraging the strengths and capabilities of various data center components across various domains, solutions can be architected to address even the seemingly unsolvable problems that will continue to frustrate IT as it moves toward a more distributed and highly dynamic data center.233Views0likes0CommentsCloud vs Cloud
The Battle of the Clouds Aloha! Welcome ladies and gentleman to the face off of the decade. The Battle of the Clouds. In this corner, the up and comer, the phenom that has changed the way IT works, wearing the light shorts - The Cloud! And in this corner, your reigning champ, born and bred of Mother Nature with unstoppable power, wearing the dark trunks - Storm Clouds! You’ve either read about or lived through the massive storm that hit the Mid-Atlantic coast last week. And, by the way, if you are going through a loss, damage or worse, I do hope you can recover quickly and wish you the best. The weather took out power for millions including a Virginia ‘cloud’ datacenter which hosts a number of entertainment and social media sites. Many folks looking to get thru the candle-lit evenings were without their fix. While there has been confusion and growing pains over the years as to just what ‘cloud computing’ is, this instance highlights the fact that even The Cloud is still housed in a data center, with four walls, with power pulls, air conditioning, generators and many of the features we’ve become familiar with ever since the early days of the dot com boom (and bubble). They are physical structures, like our homes, that are susceptible to natural disasters among other things. Data centers have outages all the time but a single traditional data center outage might not get attention since it may only involve a couple companies – when a ‘cloud’ data center crashes, it could impact many companies and like last week, it grabbed headlines. Business continuity and disaster recovery are one of the main concerns for organizations since they rely on their system’s information to run their operations. Many companies use multiple data centers for DR and most cloud providers offer multiple cloud ‘locations’ as a service to protect against the occasional failure. But it is still a data center and most IT professionals have come to accept that a data center will have an outage – it’s just a question of how long and what impact or risk is introduced. In addition, you need the technology in place to be able to swing users to other resources when a outage occurs. A good number of companies don’t have a disaster recovery plan however, especially when backing up their virtual infrastructure in multiple locations. This can be understandable for a smaller start ups if backing up data means doubling their infrastructure (storage) costs but can be double disastrous for a large multi-national corporation. While most of the data center services have been restored and the various organizations are sifting through the ‘what went wrong’ documents, it is an important lesson in redundancy….or the risk of lack of. It might be an acceptable risk and a conscious decision since redundancy comes with a cost – dollars and complexity. A good read about this situation is Ben Coe’s My Friday Night With AWS. The Cloud has been promoting (and proven to some extent) it’s resilience, DR capabilities and it’s ability to technologically recover quickly yet Storm Clouds have proven time and again, that it’s power is unmatched…especially when you need power to turn on a data center. ps Resources Virginia Storm Knocks Out Popular Websites Millions without power as heat wave hammers eastern US Amazon Power Outage Exposes Risks Of Cloud Computing My Friday Night With AWS Modern life halted as Netflix, Pinterest, Instagram go down Storm Blamed for Instagram, Netflix, and Foursquare Outages (Real) Storm Crushes Amazon Cloud, Knocks out Netflix, Pinterest, Instagram232Views0likes0CommentsMission Impossible: Stateful Cloud Failover
The quest for truly stateful failover continues… Lightning was the latest cause of an outage at Amazon, this time in its European zones. Lightning, like tornadoes, volcanoes, and hurricanes are often categorized as “Acts of God” and therefore beyond the sphere of control of, well, anyone other than God. Outages or damages caused by such are rarely reimbursable and it’s very hard to blame an organization for not having a “plan” to react to the loss of both primary and secondary power supplies due to intense lightning strikes. The odds of a lightning strike are pretty high in the first place – 576000 to 1 – and though the results can be disastrous, such risk is often categorized as low enough to not warrant a specific “plan” to redress. What’s interesting about the analysis of the outage is the focus on what is, essentially, stateful failover capabilities. The Holy Grail of disaster recovery is to design a set of physically disparate systems in which the secondary system, in the event the primary fails, immediately takes over with no interruption in service or loss of data. Yes, you read that right: it’s a zero-tolerance policy with respect to service and data loss. And we’ve never, ever achieved it. Not that we haven’t tried, mind you, as Charles Babcock points out in his analysis of the outage: Some companies now employ a form of disaster recovery that stores a duplicate set of virtual machines at a separate site; they're started up in the event of failure at the primary site. But Kodukula said such a process takes several minutes to get systems started at an alternative site. It also results in loss of several minutes worth of data. Another alternative is to set up a data replication system to feed real-time data into the second site. If systems are kept running continuously, they can pick up the work of the failed systems with a minimum of data loss, he said. But companies need to employ their coordination expertise to make such a system work, and some data may still be lost. -- Amazon Cloud Outage: What Can Be Learned? (Charles Babcock, InformationWeek, August 2011) Disaster recovery plans are designed and implemented with the intention of minimizing loss. Practitioners are well aware that a zero-tolerance policy toward data loss for disaster recovery architectures is unrealistic. That is in part due to the “weakest link” theory that says a system’s is only as good as its weakest component’s . No application or network component can perform absolutely zero-tolerance failover, a.k.a. stateful failover, on its own. There is always the possibility that some few connections, transactions or sessions will be lost when a system fails over to a secondary system. Consider it an immutable axiom of computer science that distributed systems can never be data-level consistent. Period. If you think about it, you’ll see why we can deduce, then, that we’ll likely never see stateful failover of network devices. Hint: it’s because ultimately all state, even in network components, is stored in some form of “database” whether structured, unstructured, or table-based and distributed systems can never be data-level consistent. And if even a single connection|transaction|session is lost from a component’s table, it’s not stateful, because stateful implies zero-tolerance for loss. WHY ZERO-TOLERANCE LOSS is IMPOSSIBLE Now consider what we’re trying to do in a failover situation. Generally speaking we’re talking about component-level failure which, in theory and practice, is much easier than a full-scale architectural failover scenario. One device fails, the secondary takes over. As long as data has been synchronized between the two, we should theoretically be able to achieve stateful failover, right? Except we can’t. One of the realities with respect to high availability architectures is that synchronization is not continuous and neither is heartbeat monitoring (the mechanism by which redundant pairs periodically check to ensure the primary is still active). These processes occur on a period interval as defined by operational requirements, but are generally in the 3-5 second range. Assuming a connection from a client is made at point A, and the primary component fails at point A+1 second, it is unlikely that its session data will be replicated to the secondary before point A+3 seconds, at which time the secondary determines the primary has failed and takes over operation. This “miss” results in data loss. A minute amount, most likely, but it’s still data loss. Basically the axiom that zero-tolerance loss is impossible is a manifestation in the network infrastructure of Brewer’s CAP theorem at work which says you cannot simultaneously have Consistency, Availability and Partition Tolerance. This is evident before we even consider the slight delay that occurs on the network despite the use of gratuitous ARP to ensure a smooth(er) transition between units in the event of a failover, during which time the service may be (rightfully) perceived as unavailable. But we don’t need to complicate things any more than they already are, methinks. What is additionally frustrating, perhaps, is that the data loss could potentially be caused by a component other than the one that fails. That is, because of the interconnected and therefore interdependent nature of the network, a sort of cascading effect can occur in the event of a failover based on the topological design of the systems. It’s the architecture, silly, that determines whether data loss will be localized to the failing components or cascade throughout the entire architecture. High availability architectures based on a parallel data path design are subject to higher data loss throughout the network than are those based on cross-connected data path designs. Certainly the latter is more complicated and harder to manage, but it’s less prone to data loss cascade throughout the infrastructure. Now, add in the possibility that cloud-based disaster recovery systems necessarily leverage a network connection instead of a point-to-point serial connection. Network latency can lengthen the process and, if the failure is in the network connection itself, is obviously going to negatively impact the amount of data lost because synchronization cannot occur at all for that period of time when the failed primary is still “active” and the secondary realizes there’s a problem, Houston, and takes over responsibility. Now take these potential hiccups and multiply them by every redundant component in the system you are trying to assure availability for, and remember to take into consideration that to fail over to a second “site” requires not only data (as in database) replication but also state data replication across the entire infrastructure. Clearly, this is an unpossible task. A truly stateful cloud-based failover might be able to occur if the stars were aligned just right and the chickens sacrificed and the no-fail dance performed to do so. And even then, I’d bet against it happening. The replication of state residing in infrastructure components, regardless of how necessary they may be, is almost never, ever attempted. The reality is that we have to count on some data loss and the best strategy we can have is to minimize the loss in part by minimizing the data that must be replicated and components that must be failed over. Or is it? F5 Friday: Elastic Applications are Enabled by Dynamic Infrastructure The Database Tier is Not Elastic Greedy (IT) Algorithms The Impossibility of CAP and Cloud Brewer’s CAP Theorem Joe Weinman – Cloud Computing is NP-Complete Proof Cloud Computing Goes Back to College221Views0likes0Comments