on 24-Aug-2012 13:35
#fsi problems are very public, but provide warning messages for all enterprises.
The recent troubles in High Frequency Trading (HFT) involving problems on the NASDAQ over the debut of Facebook, the Knight Trading $400 Million USD loss among others are a clear warning bell to High Frequency Trading organizations. The warning comes in two parts: “Testing is not optional”, and “Police yourselves or you will be policed”. Systems glitches happen in every industry, we’ve all been victims of them, but Knight in particular has been held up as an example of rushing to market and causing financial harm. Not only did the investors in Knight itself lose most of their investment (between the impact on the stock price and the dilution of shares their big-bank bailout entailed, it is estimated that their investors lost 80% or more in a couple of weeks), but when the price of a stock – particularly a big-name stock, which the hundred they were overtrading mostly were – fluctuates dramatically, it creates winners and losers in the market. For every person that bought cheap and sold high, there was a seller at the low end and a buyer at the high end. Regulatory agencies and a collection of university professors are reportedly looking into an ISO-9000 style quality control system for the HFT market. One more major glitch could be all it takes to send them to the drafting table, so it is time for the industry itself to band together and place controls, or allow others to dictate controls.
Quality assurance in every highly competitive industry has this problem. They are a cost center, and while everyone wants a quality product, many are annoyed by the “interference” QA brings into the software development process. This can be worse in a highly complex network like HFT or a large multi-national requires, because replicating the network for QA purposes can be a daunting project. This is somewhat less relevant in smaller organizations, but certainly there are mid-sized companies with networks every bit as complex as large multi-nationals.
Luckily, we have reached a point in time where a QA environment can be quickly configured and reconfigured, where testing is more of a focus on finding quality problems with the software than on configuring the environment – running cables, etc – that has traditionally cost QA for networked applications a lot of time or a lot of money maintaining a full copy of the production network.
From this point forward, I will mention F5 products by name. Please feel free to insert your favorite vendors’ name if they have a comparable product. F5 is my employer, so I know what our gears’ capabilities are, competitors that is less true for, so call your sales folks and ask them if they support the functionality described. Wouldn’t hurt to do that with F5 either. Our sales people know a ton, and can clarify anything that isn’t clear in this blog.
In the 21st century, testing and Virtualization go hand-in-hand. There are a couple of different forms of network virtualization that can help with testing, depending upon the needs of your testing team and your organization. I refer to them as QA testing and performance testing, think of them as “low throughput testing” and “high throughput testing”. If you’re not testing performance, you don’t need to see a jillion connections a second, but you do need to see all of the things the application does, and make certain they match requirements (more often design, but that’s a different blog post concerning what happens if the design doesn’t adequately address requirements and testing is off of design…).
For low throughput testing, virtualization has been the king for a good long while, with only cloud even pretending to challenge the benefits of a virtualized environment. Since “cloud” in this case is simply IaaS running VMs, I see no difference for QA purposes. This example could be in the cloud or on your desktop in VMs. Dropping a Virtual Application Delivery Controller (vADC) into the VM environment will provide provisioning of networking objects in the same manner as is done in the production network.
This is very useful for testing multiple-instance applications for friendliness. It doesn’t take much of a mistake to turn the database into the bottleneck in a multiple-instance web application. Really. I’ve seen it happen. QA testing can see this type of behavior without the throughput of a production network, if the network is designed to handle load balanced copies of the application.
It is also very useful for security screening, assuming the vADC supports a Web Application Firewall (WAF) like the BIG-IP with its Application Security Manager. While testing security through a WAF is useful, the power of the WAF really comes into play when a security flaw is discovered in QA testing. Many times, that flaw can be compensated for with a WAF, and having one on the QA network allows staff to test with and without the WAF. Should the WAF provide cover for the vulnerability, an informed decision can then be made about whether the application deployment must be delayed for a bug fix, or if the WAF will be allowed to handle protection of that particular vulnerability until the next scheduled update. In many cases, this saves both time and money.
In cases of heavy backend transport impacting the performance of web applications – like mirroring database calls to a remote datacenter – the use of a WAN Optimization manager can be evaluated in test to see if it helps performance without making changes to the production network.
Testing network object configurations is easier too. If the test environment is set up to mirror the production network, the only difference being that testing is 100% virtualized, then the exact network object – load balancing, WAN optimization, Application Acceleration, Security, and WAF can all be configured in QA Test exactly as they will be configured in production. This allows for thorough testing of the entire infrastructure, not just the application being deployed.
For high-throughput testing, the commodity hardware that runs VMs can be a limiting factor in the sense that the throughput in test needs to match the expected usage of the application at peak times. For these scenarios, organizations with high-volume, mission-critical applications to test can run the same exact testing scenario using a hardware chassis capable of multi-tenancy. As always, I work for F5 so my experience is best couched in F5 terms. Our VIPRION systems are capable of running multiple different BIG-IP instances per blade. That means that in test, the exact same hardware that will be used in production can be used for performance evaluation.
Everything said above about QA testing – WAF, Application Acceleration, testing for bottlenecks, all apply. The biggest difference is that the tests are on a physical machine, which might make testing to the cloud more difficult as the machine cannot be displaced to the cloud environment.
To resolve this particular issue, the hybrid model can be adopted. VIPRION on the datacenter side and BIG-IP VE on the cloud side, in the case of F5.
Utilizing management tools like the iApps Analytics built in to F5 Enterprise Manager (EM) allow testers to see which portion of the architecture is limiting performance, and save man-hours searching out problems.
It’s Still About The App and the Culture
In the end, the primary point of testing is to safeguard against coding errors that would cause real pain to the organization and get them fixed before the application is turned live. The inclusion of network resources in testing is a reflection of the growing complexity many web based applications are experiencing in supporting infrastructure. Just as you wouldn’t test a mainframe app on a PC, testing a networked app outside of the target environment is not conclusive.
But the story at Knight trading does not appear to be one about testing, but rather culture. In a rush to meet an artificial deadline, they appear to have cut corners and rushed changes in the night before.
You can’t fix problems with testing if you aren’t testing. Many IT shops need to take that to heart. The testers I have worked with over the years are astounding folks with a lot of smarts, but all suffer from the problem that their organization doesn’t value testing at the level it does other IT functions. Dedicated testing time is often in short order and the first thing to go when deadlines slip. Quite often testers are developers who have the added responsibility of testing. But many of us have said over the years and will continue to say… Testing your own code is not a way to find bugs. Don’t you think – really think – that if a developer thinks of it in testing, he/she probably thought of it during development? While those problems not thought of in development can certainly be caught, a fresh set of eyes setting up tests outside the context of developer assumptions is always a good idea.
And Yeah, it’s happened to me
Early in my career, I was called upon to make a change to a software package used by some of the largest banks in the US. I ran the change out in a couple of hours, I tested it, there was pressure to get it out the door so the rockstars in our testing department didn’t even know about the change, and we delivered it electronically to our largest banking customer – who was one of the orgs demanding the change. In their environment, the change over-wrote the database our application used. Literally destroyed a years’ worth of sensitive data. Thankfully, they had followed our advice and backed up the database first. While they were restoring, I pawed through the effected change line-by-line and found that the error destroying their database was occurring in my code, just not doing any real harm (it was writing over a different file on my system), so I didn’t notice it. Testing would have saved all of us a ton of pain because it would have been putting the app in a whole new environment. But this was a change “that they’re demanding NOW!” The bank in question lost the better part of a day restoring things to normal, and my company took an integrity hit with several major customers. I learned then that a few hours of testing of a change that took a few hours to write is worth the investment, no matter how much pressure there is to deliver.
Since then, I have definitely pushed to have a test phase with individuals not involved with development running the app. And of course, the more urgent the change, the more I want at least one person to install and test outside of my dev machine.
And you should too.
Related Articles and Blogs
There is more to it than performance.
DEFCON 20 Highlights