Patch Management - Bits, Bad Guys, and Bucks!
(This article was originally published in 2003 by Secure Business Quarterly, a now-defunct publication. Not having an original copy handy and not being able to refer people to the original site, I have retrieved a copy from the Internet Archive Wayback Machine (dated 2006 in their archive). The text of the original article is reproduced here for convenience.)
After the flames from Slammer's attack were doused and the technology industry caught up on its lost sleep, we started asking questions. Why did this happen? Could we have prevented it? What can we do to keep such a thing from happening again?
These are questions we ask after every major security incident, of course. We quickly learned that the defect in SQL Server had been identified and patches prepared for various platforms more than six months before, so attention turned to system administrators. Further inquiry, however, shows that things are more complex.
There were several complicating factors that conspired to make success at patching this system problematical. First of all, there were several different patches out, none of which had been widely or well publicized. In addition, there were confusing version incompatibilities that made the patching of some systems into much larger endeavors, as chains of dependencies had to be unraveled and entire sets of patches applied and tested. And finally, to add insult to injury, at least one patch introduced a memory leak into SQL Server.
As if that weren't enough, MSDE includes an invisible SQL Server. MSDE comes with a component of Visual Studio, which made that product vulnerable even though it neither included an explicit SQL Server license nor any DBA visibility. That shouldn't have added risk, except that some pieces of software were shipped with MSDE and other no-longer-needed parts of the development environment included. As we all know, many software products are shipped with development artifacts intertwined with the production code because disk space is cheaper than keeping track of and subsequently removing all of the trash lying around in the development tree. And those development tools are really useful when tech support has to diagnose a problem.
To compound the challenge, patches in general can't be trusted without testing. A typical large environment runs multiple versions of desktop operating systems, say NT4, 2000, XP Home, and XP Pro. If the patch addresses multiple issues across several versions of a common application, you're talking about a product that has about fifty configuration permutations. Testing that many cases represents significant time and cost.
Finally, there's the sheer volume of patches flowing from the vendor community. There's no easy way for an administrator to tell whether a particular patch is 'really serious,' 'really, really, serious,' or 'really, really, really, serious.' The industry hasn't yet figured out how to normalize all of the verbiage. Even so, knowing that a weakness exists doesn't give any insight into how virulent a particular exploit of that weakness might prove to be. Slammer was remarkably virulent, but its patch went out to the systems community along with thousands of remedies for other weaknesses that haven't been exploited nearly so effectively.
Unfortunately, an aggressive strategy of applying all patches is one that is uneconomical with the current operating model of the industry. Let's look at some numbers. These numbers are benchmark numbers that are broadly typical of costs and performance in the entire industry, not specific to any individual company. The state of automation in the desktop OS world has improved dramatically in the last ten years. A decade ago an upgrade to a large population of desktop machines required a human visit to each machine. Today the automated delivery of software is dramatically superior, but not where it ultimately needs to be. Let's say, for the purpose of argument, that automated patch installation for a large network is 90% successful. (A colleague suggests that today a more realistic number is 80%, "even assuming no restriction on network capacity and using the latest version of SMS"; he characterized 90% as the "go out and get drunk" level of success.) A person must visit each of the 10% of machines for which the automated installation failed. This person must figure out what went wrong and install the patch by hand. For a large corporate network with, say, 50,000 machines to be patched, that translates to 5,000 individual visits. A benchmark number for human support at the desktop is about $50 per visit. Thus, a required patch in a modern environment translates into a $250,000 expenditure. That's not a trivial amount of money and it makes the role of a system manager, who faces tight budgets and skeptical customers, even more challenging.
The costs aside, how close to 100% is required to close a loophole? Informal comments from several CISOs suggest that Slammer incapacitated corporate networks with roughly 50,000 hosts by infecting only about two hundred machines. That's 0.4%. With NIMDA, it was worse: one enterprise disconnected itself from the Internet for two weeks because it had two copies of NIMDA that were actually triggered. What can we do to make our systems less vulnerable and reduce both the probability of another Slammer incident and, more importantly, the harm that such an incident threatens? We can work together in the industry to improve the automated management of systems. Every 1% improvement in the performance of automated patch installation systems translates directly into $25,000 cash savings for the required patches in our example. An improvement of 9%, from 90% to 99%, translates into a savings of $225,000 for each patch that must be distributed. Do that a few times, and, as Everett Dirksen noted, "pretty soon you're talking about real money."
Improving the effectiveness of patch application requires that we improve both our software packaging and our patch distribution and execution automation. After we improve packaging and distribution, we can figure out a way to easily and quickly tell what components reside on each system, ideally by asking the system to tell us. Databases get out of date but the system itself usually won't lie. We can build automated techniques to identify all of the patches from all relevant vendors that need to be applied to a given system -- and then apply them. We can either simplify our system configurations, which has obvious benefits, or figure out ways to ensure that components are better behaved, which reduces the combinatorial complexity of applying and testing the applied patches.
The methods and practices of the security industry, particularly the high technology security world, have been for years derived from those developed for national security problems. These are problems for which the cost of failure is so enormous, as in the theft or misuse of nuclear weapons, that failure is not an option. For this class of problem, the commercial practice of balancing risk and cost is impossible, except in the vacuous limiting case in which cost is infinite. Now we are working through practical security management in a commercial environment and we are beginning to get our hands around some of the quantitative aspects of the problems we face. If getting a patch out to all computers in our environment will cost us a quarter of a million dollars and we have X patches per year, then we have to weigh that cost against the cost of the harm suffered and the cleanup expense incurred if we don't distribute all of the patches. It tells us to spend some money, though not an arbitrarily large amount; on improving the quality of our automated patch distribution and application processes, with an objective of absolute 100% coverage for automatic updates. It tells us to work for a better system of quantification of threat severity. It tells us to spend effort on strengthening our incident response capabilities.